Causal Inference Lecture Notes: Selection Bias in Observational Studies

Size: px

Start display at page:

Download "Causal Inference Lecture Notes: Selection Bias in Observational Studies"

Jonas White
5 years ago
Views:

1 Causal Inference Lecture Notes: Selection Bias in Observational Studies Kosuke Imai Department of Politics Princeton University April 7, 2008 So far, we have studied how to analyze randomized experiments. Unfortunately, in a vast majority of social science, the randomization of treatment assignment is impossible. This means that researchers are forced to conduct observational studies rather than experimental studies. The key distinction between the two types of research is whether or not researchers can randomize the treatment assignment. In observational studies, researchers only passively observe the treatment status for each unit that was assigned in some non-random fashion. Here, we discuss how to analyze such observational data. 1 Randomized and Quasi-randomized Natural Experiments Before we start discussing some identification and statistical issues, we briefly talk about natural experiments. There are two types of natural experiments; randomized natural experiments and quasi-randomized natural experiments. The first type refers to the studies where the randomization of treatment assignment is actually conducted in real world settings by bureaucrats and other agents rather than by researchers as a part of their study. In these studies, the exact treatment assignment mechanism is random and known to researchers. Here are some examples: Vietnam draft lottery (Angrist, 1990) Study of lottery players (Imbens et al., 2001) California alphabet lottery (Ho and Imai, 2006, 2008) Random assignment of roommates in college (Sacerdote, 2001) In these studies, the random treatment assignment mechanism is known exactly to researchers and yet is not conducted by themselves as an intervention. This means that one can analyze these data 1

2 using the design-based approaches we have studied. Thus, both external and internal validity are ensured. Unfortunately, there exist a small number of such randomized natural experiments in the real world. In contrast, quasi-randomized natural experiments refer to the studies where the exact treatment assignment mechanism is unknown to researchers but is plausibly random. In these studies, researchers assume that the treatment is randomly assigned and conduct statistical analyses based on this assumption. In principle, the assumption cannot be directly tested although testable implications can be derived in some studies. This means that the assumption is more credible in some studies than others. Another important point is that since the treatment assignment mechanism is unknown, the design-based analysis is impossible and researchers must rely on modeling assumptions that are not directly testable. There are many more quasi-randomized natural experiments than randomized natural experiments. In fact, almost all observational studies fit into this category except that some are more plausible than others. Here are some examples: Quarter of birth (Angrist and Krueger, 1991) Close elections (Lee et al., 2004) Weather (Miguel et al., 2004) Geography (Hoxby, 2000) Identical twins (Ashenfelter and Krueger, 1994) Disease outbreak (Acemoglu et al., 2001) Indeed, journals such as American Economic Review, Journal of Political Economy, and Quarterly Journal of Economics spend most of their pages to quasi-randomized natural experiments of this sort. When evaluating these studies, one needs to pay attention to both internal validity (Are statistical assumptions credible?) and external validity (What are the causal estimands? To what population are the results generalizable?). In addition, it is important to assess the sensitivity of the conclusions to the violations of maintained assumptions. 2 Causal Inference Under Exogeneity 2.1 Assumptions An alternative assumption in observational studies is exogeneity. This assumption is behind all regression-based inference and can be decomposed into two parts. The first part is called unconfoundedness and is defined formally as, (Y i (1), Y i (0)) T i X i = x for all x. (1) 2

3 This assumption is also called no omitted variable bias, ignorability, or selection on observables. The assumption implies that conditional on observed pre-treatment covariates, the treatment assignment is independent of potential outcomes. In other words, we assume that the treatment assignment is randomized after conditioning on a sufficient set of covariates. Note that this is not the assumption one can directly test from the data. The second part of the exogeneity assumption is called overlap and is defined as, 0 < Pr(T i = 1 X i = x) < 1 for all x. (2) This assumption ensures that all units in the population has non-zero probability of being assigned to the treatment/control group. That is, the assumption guarantees that all units in the control group can be used to infer unobserved counterfactual outcomes of the treatment group units. Under these two assumptions, the average treatment effect is identified, E(Y i (1) Y i (0)) = E {E(Y i T i = 1, X i ) E(Y i T i = 0, X i )}, (3) where E(Y i T i = t, X i ) can be estimated using regression for t = 0, 1. As we discussed earlier in the course, when averaging over a multidimensional vector X i is difficult, one may focus on the conditional average treatment effect by averaging over the sample distribution of X i. 1 n n E(Y i (1) Y i (0) X i ). (4) The central difficulty of the exogeneity assumption is that one typically needs to collect a large number of pre-treatment covariates in order to make the unconfoundedness assumption credible. However, a large dimension of X i means that regression modeling becomes a challenging task. Ideally, one would like to apply nonparametric regression techniques in order to model E(Y i T i, X i ) without imposing a parametric structure. However, the amount of data required by such nonparametric regressions increases exponentially as the number of covariates increases. This is known as the curse of dimensionality. In addition, many nonparametric regression techniques require users to choose a value of the smoothing parameters, and the results can be sensitive to this choice. For this and other theoretical and computational reasons, nonparametric regressions have been rarely used by social scientists. The situation could change, however, in the near future as the theoretical literature on nonparametric regressions is fast growing. 2.2 Matching Given the practical difficulty of nonparametric regression techniques, matching methods have become a popular choice for social scientists who recognize the fragility of parametric regression modeling. The idea is very simple. Under the exogeneity assumption, matching methods try to reduce the observed covariate differences between the treatment and control groups. 3

4 To motivate matching methods, we consider the following bias decomposition due to Heckman et al. (1998). In particular, we consider the estimation of the (conditional) average treatment effect for the treated. 1 n T i i {i:t i =1} E(Y i (1) Y i (0) X i ). (5) 1 If we focus on the bias that arises by estimating P n T i i {i:t i =1} E(Y i(0) X i ), the bias can be written as, E(Y i (0) T i = 1, X i )d F (X i T i = 1) E(Y i (0) T i = 0, X i )d F (X i T i = 0) S 1 S 0 = E(Y i (0) T i = 1, X i )d F (X i T i = 1) E(Y i (0) T i = 0, X i )d F (X i T i = 0) + + S 1 \S S S S 0 \S {E(Y i (0) T i = 1, X i ) E(Y i (0) T i = 0, X i )} d F (X i T i = 1) E(Y i (0) T i = 0, X i ) d{ F (X i T i = 1) F (X i T i = 0)}, where S 1 and S 0 are the empirical support of X i for the treatment and control groups, and S = S 1 S0. Thus, the first term corresponds to the extrapolation bias which is zero if the overlap assumption is satisfied. The second term is the omitted variable bias which may remain even after conditioning on the observed pre-treatment variables. And finally, the third term is the bias due to imbalance (but within the common support) of the observed pre-treatment covariates. Matching methods seek the minimize this third component of bias by reducing sample imbalance, F (X i T i = 1) F (X i T i = 0). (6) Of course, even with matching methods, one cannot escape from the curse of dimensionality. To make equation 6 hold exactly, we need to do exact matching where each treated unit is matched with a control unit that has the exact same values for all covariates. This may be easy when there is one covariate, but becomes more difficult as the number of covariates increases. In such multivariate matching settings, one must rely on distance measures. A common distance measure includes the Mahalanobis distance measure which is defined as, (X i X j ) Σ 1 (X i X j ) (7) where Σ is the sample covariance matrix. In particular, Mahalanobis matching refers to the procedure where one computes the Mahalanobis distance measure from the sample mean, i.e., set X j = n X i/n, for each observation and create pairs based on that measure. The Mahalanobis distance measure is a reasonable metric, but is, like any other such measure, arbitrary. In particular, it weights every variable equally, which may not be appropriate since some pre-treatment covariates are more important than others. To partially overcome this problem, we next turn to another measure, called propensity score. 4

5 2.3 Propensity Score The propensity score is the conditional probability of receiving the treatment given the pretreatment covariates. Formally, it is defined as, π(x i ) Pr(T i = 1 X i ). (8) In a very influential paper, Rosenbaum and Rubin (1983b) first show that under exogeneity, the propensity score has the so-called balancing property, T i X i π(x i ). (9) Next, they prove that the treatment assignment is conditionally independent of potential outcomes given the propensity score alone, (Y i (1), Y i (0)) T i π(x i ). (10) This property suggests that one may conduct matching methods based on the propensity score rather than a multidimensional vector X i. This propensity score matching essentially weights different covariates depending on their relative importance in terms of its predictive power about receiving the treatment. The result in equation 10 makes intuitive sense because in randomized experiments the knowledge of π(x i ) for each unit i is what allows researchers to identify the average treatment effect. For non-binary treatment regimes, Imai and van Dyk (2004) proves similar theorems for what they call propensity function. They suggest the analysis based on subclassification of the propensity function. Here, however, we focus on the case of binary treatment variables. Of course, in observational studies, the propensity score is unknown and therefore must be estimated from the data. Here again, researchers face the curse of dimensionality; X i may be a high-dimensional vector and modeling π(x i ) nonparametrically will be perhaps as difficult as modeling E(Y i T i, X i ) directly. It then appears that the propensity score has not given us any advantage at all! However, there are at least two reasons why the propensity score might be useful. First, the propensity score comes with a direct diagnostic about model specification. Recall that the true propensity score satisfies the so-called balancing property in equation 9. This means that if your propensity score is a good estimate, then the balancing property must be satisfied. That is, after matching on the propensity score, one should achieve a good covariate balance between the treated and control units in the matched sample, i.e., F (Xi T i = 1) F (X i T i = 0). What does this mean? Well, the propensity score works when it is correct! Ho, Imai, King, and Stuart (2007) call this propensity score tautology and suggests that researchers should may use propensity score for matching but they should adjust propensity score model specification until a good balance is achieved in the resulting matched sample. From this perspective, the propensity score can give a potentially useful technology to conduct multivariate matching. 5

6 The second justification for propensity score methods is so called double-robustness property. To understand the class of doubly-robust estimators, we start with the idea of propensity score weighting which is based on the classic estimator proposed by Horvitz and Thompson (1952) for survey sampling, ˆτ HT 1 n n T i Y i ˆπ(X i ) 1 n n (1 T i )Y i 1 ˆπ(X i ). (11) Note that this generalizes Neyman s estimator to the situation with heterogeneous treatment assignment probability. If ˆπ(X i ) is consistently estimated, then ˆτ HT consistently estimates the ATE. However, the problem is that ˆπ(X i ) may not be consistently estimated. This problem is similar to the approach that directly models the outcome variable without using propensity scores. Consider a linear regression model, Y i (T i ) = α + βt i + γx i + ɛ i. (12) The least squares estimate of β is consistent for the ATE if the model is correct, but may be quite far from the truth if modeling assumptions are incorrect. One possible solution to this problem is to combine the direct modeling approach with propensity score weighting methods. In particular, consider the following combined estimator, { 1 n ˆτ C (ˆα + n ˆβ + ˆγX i ) + 1 n n { 1 n (ˆα + ˆγX i ) + 1 n n n T i ˆπ(X i ) (Y i ˆα ˆβ ˆγX i ) (1 T i ) 1 ˆπ(X i ) (Y i ˆα ˆγX i ) } }. (13) It is immediate that if the propensity score model is correct, then ˆτ c reduces to the propensity score weighting estimator ˆτ HT in a large sample, regardless of whether or not the outcome model is correct. On the other hand, if the outcome model is correct, then it reduces to the least squares estimator in a large sample. Thus, ˆτ C is a doubly-robust estimator which consistently estimates the ATE so long as either the outcome and propensity score models are correct. ˆτ DR In general, a typical doubly robust estimator can be written as (Robins et al., 1994), { } { } 1 n ˆm 1 (X i ) + 1 n T i (Y i ˆm 1 (X i )) 1 n ˆm 0 (X i ) + 1 n (1 T i )(Y i ˆm 0 (X i )), n n ˆπ(X i ) n n 1 ˆπ(X i ) where m t (X i ) = E(Y i (t) X i ) for t = 0, 1. It has been shown that this class of estimators achieves the semiparametric efficiency bound if both propensity score and outcome models are correct. The recent methodological work in this area explores the doubly robust estimators that are efficient whenever the propensity score model is correctly specified (but the outcome model is not) (e.g., Tan, 2006). Propensity score weighting is more efficient than propensity score matching because it incorporates all the information about the propensity score. However, this means that the results are 6 (14)

7 more likely to be sensitive to the way the weights are estimated. In particular, it is known that when the weights are small the weighting estimator results in a high variance. A forthcoming issue of Statistical Science has a debate on this specific issue too (Kang and Schafer, In-press; Robins et al., In-press). 2.4 The Art of Matching Whether or not one uses a specific form of doubly robust estimators, one can think of matching methods as a nonparametric preprocessing procedure to help reduce the parametric model dependence (Ho, Imai, King, and Stuart, 2007). By making the treatment and control groups similar to each other in terms of their covariate distributions, matching methods reduce the need for extreme inter/extrapolation. Thus, one should not think of matching methods as a replacement of parametric regression modeling. Instead, matching methods can be used to make parametric models robust. In most studies, exact matching is impossible, and so matching does not eliminate the observed differences. In this case, parametric adjustments are essential to make adjustments for remaining differences. Just like regression modeling, matching involves art as well as science. Indeed, there are many different matching methods. Matching can be done with or without replacement. In addition to pair matching, there are a variety of matching methods. Ratio matching or one-to-many matching match each treated unit with several control units. Caliper matching places restrictions on which observations can be matched. Subclassification will divide the sample into several groups of similar observations. Optimal full matching uses the varying number of subclassifications of different sizes in order to maximizing a pre-defined measure of balance. Genetic matching takes this idea further. The problem of choosing among different matching algorithms is not trivial. Since the ultimate goal of matching is to maximize the resulting balance, it does not matter which matching procedure one uses. However, the problem is that there is no single measure of balance. Some use balance tests such as t-tests, but this has an obvious problem that changing the sample size changes the measure of balance even though the balance may not have changed at all (Imai, King, and Stuart, 2008). Another issue is how to compare different balance measures for the same variable (e.g., mean difference vs. variance ratio) or the same measure across different variables. These problems are not going to disappear in the near future as no objective consensus is likely to emerge. My own personal take is to rely one s substantive knowledge at least to some extent. If you know some variables that are important, then match exactly on them and then use propensity score to make adjustment for the other variables. See Ho, Imai, King, and Stuart (2007) for a detailed discussion of matching methods and references to the statistics literature. 7

8 3 Bounding the Average Treatment Effect In many situations, the exogeneity assumption may not be credible. Thus, one may conduct the nonparametric identification analysis without such a strong assumption. Manski (2007) and others have taken this approach. First, Manski (1990) and Robins (1989) establish the following noassumption bounds. Suppose that the outcome variables are bounded, < a t Y i (t) b t < for t = 0, 1. Then, the no-assumption bounds on the average treatment effect at X i, i.e., τ(x i ), is, τ(x i ) [a 1 Pr(T i = 0 X i ) + E(Y i T i = 1, X i ) Pr(T i = 1 X i ) E(Y i T i = 0, X i ) Pr(T i = 0 X i ) b 0 Pr(T i = 1 X i ), b 1 Pr(T i = 0 X i ) + E(Y i T i = 1, X i ) Pr(T i = 1 X i ) E(Y i T i = 0, X i ) Pr(T i = 0 X i ) a 0 Pr(T i = 1 X i )]. (15) The width of the bounds is (b 1 a 1 ) Pr(T i = 0 X i ) + (b 0 a 0 ) Pr(T i = 1 X i ), which generally depends on the treatment assignment probability. If, however, we have a = a 0 = a 1 and b = b 0 = b 1, then the width is given by b a which no longer depends on the treatment assignment probability. In particular, this width is half of the original bounds. Thus, the sampling process alone narrows the bounds on the ATE by half. Similarly, if the outcome of interest is binary, then the bounds can be simplified. Let A be a set specified within a support of the outcome variable. Then, the bounds are given by (dropping the conditioning on X i for notational simplicity, τ [Pr(Y i A T i = 1) Pr(T i = 1) Pr(Y i A T i = 0) Pr(T i = 0) Pr(T i = 1) Pr(T i = 0) + Pr(Y i A T i = 1) Pr(T i = 1) Pr(Y i A T i = 0) Pr(T i = 0)]. (16) After obtaining the no-assumption bounds, the next step in the nonparametric identification analysis is to formulate weak assumptions and derive the bounds under them. This analysis allows one to formalize the role of additional assumptions in the identification of the ATE. Manski (1997) considers the following monotone treatment response assumption, Y i (0) Y i (1). (17) That is, we assume that the treatment does not hurt. Under this assumption, the bounds on the ATE becomes, τ [0, b 1 Pr(T i = 0)+E(Y i T i = 1) Pr(T i = 1) E(Y i T i = 0) Pr(T i = 0) a 0 Pr(T i = 1)]. (18) The result is intuitive; the lower bound improves while the upper bound is unaffected. We can generalize this result to the setting with multi-valued treatment where T i T {0, 1,..., K 1}. Assume a Y i b and Y i (s) Y i (t) for any t s with t, s T. Then, the sharp bounds are given by, 0 E(Y i (t) Y i (s)) b Pr(T i < t) + E(Y i T i t) Pr(T i t) a Pr(T i > s) E(Y i T i s) Pr(T i s). (19) 8

9 where t > s. When the outcome is binary, the bounds take a very simple form, 0 Pr(Y i (t) = 1) Pr(Y i (s) = 1) Pr(Y i = 0, T i < t) + Pr(Y i = 1, T i > s) (20) More interesting assumptions of this type occur in the settings of multiple and/or multi-valued treatments. Manski (1997) considers the semi-monotone treatment response where a vector of multi-valued treatments exist and Y i (t) Y i (s) if and only if t k s k for all elements of the treatment vectors k. The other assumption is that of concave-monotone treatment response where the potential outcome Y i (T i ) is assumed to be a concave function of the multi-valued treatment variable T i. Under these assumptions, one can derive the bounds in the manner similar to the ones above. For example, under the semi-monotone treatment response, we have the following bounds when t s, 0 E(Y i (t) Y i (s)) E(Y i T i t) Pr(T i t) + b{pr(t i < t) + Pr(T i t)} E(Y i T i s) Pr(T i s) a{pr(t i > s) + Pr(T i s)}, (21) where t s means that the pair is not ordered. When t s, the lower bound becomes, E(Y i T i t) Pr(T i t) + a{pr(t i > t) + Pr(T i t)} E(Y i T i s) Pr(T i s) a{pr(t i < s) + Pr(T i s)}. (22) Finally, yet another assumption is monotone treatment selection, which can be written as, T i = t Y i (t) Y i (1 t), (23) for t = 0, 1. For example, a super doctor may give the treatment to patients only when it works. Under this assumption, the bounds become, τ(x i ) {E(Y i T i = 1, X i ) Pr(T i = 1 X i ) + a Pr(T i = 0 X i ) E(Y i X i ), E(Y i X i ) E(Y i T i = 0, X i ) Pr(T i = 0 X i ) a Pr(T i = 1 X i )}. (24) These identification analyses are useful in the situation where the assumptions for point identification are too strong to be credible. 4 Sensitivity Analysis So far, we have introduced two approaches. The first approach makes the exogeneity assumption and develops robust statistical methods for point-identification of causal effects. In contrast, the second approach abandons the exogeneity assumption entirely and attempts to bound the causal quantities of interest. Here, we consider the third approach called sensitivity analysis. The idea is to adopt the strategy that is somewhere between the previous two approaches. Here, we present the 9

10 basic idea via a class of parametric sensitivity analyses studied by Rosenbaum and Rubin (1983a) and Imbens (2003). Consider the unobserved binary variable, U i, whose marginal distribution is Bernoulli with probability p = Pr(U i = 1). Note that we assume the independence between X i and U i. Next, consider the following parametric models, Y i (T i ) T i, X i, U i Pr(T i = 1 X i, U i ) = indep. N (τt i + X i β + δu i, σ 2 ), (25) exp(x i γ + αu i) 1 + exp(x i γ + αu i). (26) where X i is the vector of observed pre-treatment covariates. Note that the model assumes unconfoundedness conditional on U i and X i. A sensitivity analysis is carried out by fixing the sensitivity parameters, (p, α, δ), and conducting the maximum likelihood estimation of τ to see how it varies with the values of (p, α, δ). How do we choose the range of possible values for the sensitivity parameters? It is relatively easy to choose the value of p and δ. In particular, δ corresponds to the difference of the conditional expectation of the two potential outcomes, i.e., δ = E(Y i (T i ) T i, X i, U i = 1) E(Y i (T i ) T i, X i, U i = 0). Choose the range of possible values for α is not as straightforward, but one way is to consider the odds ratio, γ = Pr(T i = 1 X i, U i = 1)/ Pr(T i = 0 X i, U i = 1) Pr(T i = 1 X i, U i = 0)/ Pr(T i = 0 X i, U i = 0) = exp(α). (27) Alternatively, Imbens (2003) assumes the independence between U i and X i and considers the parameterization based on partial R 2, RY 2 (α, δ, p) = R2 Y (α, δ, p) R2 Y (0, 0, p) 1 RY 2 (0, 0, p), RT 2 (α, δ, p) = R2 T (α, δ, p) R2 T (0, 0, p) 1 RT 2 (0, 0, p) (28) where R 2 Y (α, δ, p) = 1 ˆσ(α, δ, p)/var(y i) and R 2 T (α, δ, p) = {ˆγ(α, δ, p) Σ X ˆγ(α, δ, p) + α 2 p(1 p)}/{ˆγ(α, δ, p) Σ X ˆγ(α, δ, p) + α 2 p(1 p) + π 2 /3}. Of course, the problem of this parametric sensitivity analysis is that it is postulated in a context of a particular parametric model. In addition, the distribution of U i as well as its independence with X i are assumed. Recent methodological work has developed a nonparametric sensitivity analysis in order to relax many of these assumptions (e.g., Robins et al., 1999). 5 Connections to Missing Data and Sample Selection Problems Finally, we point out that causal inference problems are statistically equivalent to missing data problems. As a result, many techniques discussed here have direct connections with those used for correcting the bias due to missing data. To see the connection, consider the estimation of the ATE which in turn requires the estimation of marginal means, µ t = E(Y i (t)) for t = 0, 1. However, for 10

11 each unit, only one potential outcome is observed. Thus, Y i (t) is observed for unit with T i = t and is missing for units with T i = 1 t. This means that causal inference is a missing data problem. In the literature on missing data, the following three assumptions are commonly considered; MCAR (missing completely at random), MAR (missing at random), and NI (nonignorable). Let R be the recording indicator variable which is equal to 1 if the data are observed and is equal to 0 otherwise. We use Y mis and Y obs to denote the missing and observed data, respectively. Then, MCAR and MAR are characterized as follows, R (Y mis, Y obs ) MCAR, (29) R Y mis Y obs MAR. (30) Thus, the MCAR in missing data problem is equivalent to randomized experiments in causal inference. Just like the treatment assignment is randomized in such experiments, the MCAR implies that the probability of nonresponse does not depend on the observed or unobserved data. In contrast, the MAR in the missing data problem is equivalent to the unconfoundedness assumption in causal inference. Under this assumption, the probability of nonresponse does not depend on unobserved data once we condition on observed data. That is, the missing data mechanism depends on the observed data and not on the unobserved data. Furthermore, the NI characterizes the situation that is neither MCAR nor MAR. Note that the phrase, ignorability, comes from the fact that under MAR (or MCAR), one can ignore the missing data mechanism if one assumes that parameters (θ, ξ) are disjoint, i.e., p(r Y obs, Y mis, ξ) and p(y obs, Y mis θ). p(r, Y obs θ, ξ) p(r Y obs, ξ)p(y obs θ) (31) Although under MAR one can ignore the modeling of R, the doubly-robust estimation methods discussed above suggest that there may be a benefit of modeling the response mechanism in addition to the regression modeling for the outcome variable. Other methods of causal inference are also applicable to missing data problems. For example, the bounds similar to the ones derived above can be obtained for missing data problems (Manski, 1989). Finally, a method that became prominent in the 1970s includes the parametric sample selection model due to Heckman (1979). This type of model can be written as, Y i = f 1 (X i ) + ɛ i (32) R i = 1{f 2 (X i ) + η i > 0}, (33) where E(ɛ i X i ) = 0. The model implies, E(Y i X i, R i = 1) = f 1 (X i ) + E{ɛ i X i, f 2 (X i ) + η i > 0}, (34) where the second term is not necessarily zero unless ɛ i is conditionally independent of η i given X i. Heckman (1979) assumed that (ɛ i, η i ) are normally distributed with mean zero and possibly 11

12 non-zero correlation, but they are independent of X i. In addition, f 1 (X i ) and f 2 (X i ) are assumed to be linear functions. Under these assumptions, the model becomes, E(Y i X i, R i = 1) = X i β 1 + γ φ(x i β 2) Φ(X i β 2), (35) where the second term is called the inverse Mills ratio. As you can see, the identification of this model hinges on its functional form unless there are some restrictions on (β 1, β 2 ), e.g., exclusion restrictions. Unfortunately, the maximum likelihood estimation of this model is difficult, and thus Heckman (1979) proposes to consistently estimate β 1 with the two-step procedure (estimate β 2 from the maximum likelihood estimation of the probit model and then estimate β 1 by plugging in ˆβ 1 in equation 35). Although this procedure is easy to implement, it does not result in the fully efficient estimator like the maximum likelihood estimation does and also the computation of valid standard error is not straightforward. A more natural way to estimate the model would be to use the EM algorithm viewing the complete-data model as the bivariate probit model. Finally, it is important to note that this estimator is not doubly-robust. That is, if either the outcome model or nonresponse model is incorrect, then the results will be incorrect. Moreover, this sample selection model does not nest the usual outcome regression where we assume, E(ɛ i X i, f 2 (X i ) + η i > 0) = E(ɛ i X i ). (36) Although under this assumptionɛ i is assumed to be uncorrelated with η i given X i, unlike Heckman s model there is no need to assume f 1 (X i ) and f 2 (X i ) are both linear (in fact, f 2 (X i ) is completely free of restriction). It is also unnecessary to assume ɛ i and η i are independent of X i. Nevertheless, applied researchers have an incorrect perception that the Heckman s model is more general than the usual regression model. In fact, subsequent research after this model was proposed has shown that the model is highly sensitive to the minor violations of distributional and functional form assumptions. 6 Final Remarks The most important distinction between experimental and observational studies is that the availability or lack of the knowledge about the treatment assignment mechanism. This means that in observational studies researchers are forced to make assumptions about the unknown treatment assignment mechanism. There are many assumptions one can make, and the choice of these assumptions must be made in each study. The same assumptions may be more reasonable in one study than in others. Although social scientists have heavily relied on the use of regression models for causal inference in observational studies, many methodological studies have shown that these methods are highly sensitive to their functional form, distributional, and other assumptions (e.g., LaLonde, 12

13 1986). Statisticians and methodologists have come up with various ways to address this problem in observational studies. They have proposed the methods that rely on less restrictive assumptions. In particular, the methods of matching and weighting that partially rely on the use of propensity scores are shown to be much more robust than the sole use of regression models. Finally, the exogeneity assumption is a strong assumption and may not hold in practice. In such cases, the method of bounds may be useful as it clarifies the identifying power of additional assumptions. An alternative approach is to conduct a sensitivity analysis where researchers examine the robustness of their conclusions to the varying degree of violations of the key assumptions. References Acemoglu, D., Johnson, S., and Robinson, J. A. (2001). The colonial origins of comparative development. American Economic Review 91, 5, Angrist, J. D. (1990). Lifetime earnings and the Vietnam era draft lottery: Evidence from social security administrative records. American Economic Review 80, Angrist, J. D. and Krueger, A. B. (1991). Does compulsory school attendance affect schooling and earnings? Quarterly Journal of Economics 106, Ashenfelter, O. and Krueger, A. (1994). Estimates of the economic return to schooling from a new sample of twins. American Economic Review 84, 5, Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica 47, 1, Heckman, J. J., Ichimura, H., and Todd, P. (1998). Matching as an econometric evaluation estimator. Review of Economic Studies 65, Ho, D. E. and Imai, K. (2006). Randomization inference with natural experiments: An analysis of ballot effects in the 2003 California recall election. Journal of the American Statistical Association 101, 475, Ho, D. E. and Imai, K. (2008). Estimating causal effects of ballot order from a randomized natural experiment: California alphabet lottery, Public Opinion Quarterly 72, 2, Forthcoming. Ho, D. E., Imai, K., King, G., and Stuart, E. A. (2007). Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis 15, 3, Horvitz, D. and Thompson, D. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47, 260, Hoxby, C. M. (2000). Does competition among public schools benefit students and taxpayers? American Economic Review 90, 5,

14 Imai, K., King, G., and Stuart, E. A. (2008). Misunderstandings among experimentalists and observationalists about causal inference. Journal of the Royal Statistical Society, Series A 171, 2, Imai, K. and van Dyk, D. A. (2004). Causal inference with general treatment regimes: Generalizing the propensity score. Journal of the American Statistical Association 99, 467, Imbens, G. W. (2003). Sensitivity to exogeneity assumptions in program evaluation. American Economic Review 93, 2, Imbens, G. W., Rubin, D. B., and Sacerdote, B. (2001). Estimating the effect of unearned income on labor earnings, savings, and consumption: Evidence from a survey of lottery players. American Economic Review 91, 4, Kang, J. D. and Schafer, J. L. (In-press). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data (with discussion). Statistical Science. LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. American Economic Review 76, 4, Lee, D. S., Moretti, E., and Butler, M. J. (2004). Do voters affect or elect policies?: Evidence from the U.S. house. Quarterly Journal of Economics 119, 3, Manski, C. (1997). Monotone treatment response. Econometrica 65, 6, Manski, C. F. (1989). Anatomy of the selection problem. Journal of Human Resources 24, 3, Manski, C. F. (1990). Non-parametric bounds on treatment effects. American Economic Review, Papers and Proceedings 80, Manski, C. F. (2007). Identification for Prediction and Decision. Harvard University Press, Cambridge, MA. Miguel, E., Satyanath, S., and Sergenti, E. (2004). Economic shocks and civil conflict: An instrumental variables approach. Journal of Political Economy 112, 4, Robins, J., Sued, M., Lei-Gomez, Q., and Rotnitzky, A. (In-press). Performance of double-robust estimators when inverse probability weights are highly variable. Statistical Science. Robins, J. M. (1989). Health Research Methodology: A Focus on AIDS (eds. L. Sechrest, H. Freeman, and A. Mulley), chap. The Analysis of Randomized and Non-randomized AIDS Treatment Trials Using a New Approach to Causal Inference in Logitudinal Studies. NCHSR, U.S. Public Health Service, Washington, D.C. 14

15 Robins, J. M., Rotnitzkey, A., and Scharfstein, D. (1999). Statistical Models in Epidemiology (E. Halloran and D. Berry, eds.), chap. Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models, Springer, New York. Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89, 427, Rosenbaum, P. R. and Rubin, D. B. (1983a). Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. Journal of the Royal Statistical Society, Series B, Methodological 45, Rosenbaum, P. R. and Rubin, D. B. (1983b). The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1, Sacerdote, B. (2001). Peer effects with random assignment: Results for Dartmouth roommates. Quarterly Journal of Economics 116, 2, Tan, Z. (2006). A distributional approach for causal inference using propensity scores. Journal of the American Statistical Association 101, 476,

Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies

Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies Kosuke Imai Department of Politics Princeton University November 13, 2013 So far, we have essentially assumed