Causal Regression Models II: Unconfoundedness and Causal Unbiasedness 1

Size: px

Start display at page:

Download "Causal Regression Models II: Unconfoundedness and Causal Unbiasedness 1"

Caroline Whitney Wheeler
5 years ago
Views:

1 Methods of Psychological Research Online 2000, Vol.5, No.3 Internet: Institute for Science Education 2000 IPN Kiel Causal Regression Models II: Unconfoundedness and Causal Unbiasedness 1 By Rolf Steyer 2, Siegfried Gabler 3, Alina A. von Davier 4, and Christof Nachtigall 4 Abstract We consider regression models with discrete units and a discrete treatment variable. In this framework, individual and average causal effects as well as causal unbiasedness of conditional expected values E(Y X = x) and of their differences were defined in a previous paper where it was also noted that a hypothesis of causal unbiasedness is not empirically testable outside the randomized experiment. Therefore, we study a stronger causality criterion which we call unconfoundedness. To our knowledge, this is the weakest empirically testable condition implying causal unbiasedness of the conditional expected values E(Y X = x). Unconfoundedness holds in randomized experiments, but it may hold in nonrandomized experiments, as well. We derive theorems about sufficient and necessary conditions, about sufficient conditions, and about necessary conditions for unconfoundedness. The latter identify the hypotheses to be tested in nonrandomized experiments when it comes to testing the weakest empirically testable sufficient condition for conditional expected values E(Y X = x ) to be causally unbiased. Keywords: Causality; Confounding; Regression Models; Randomization; Rubin s Approach to Causality 1 We would like to thank Albrecht Iseler (Free University Berlin) for a discussion in which the idea for some parts of this paper has been developed. We would also like to extend our gratitude to Angelika von der Linde (University of Edinburgh), to Donald B. Rubin (Harvard University), and to Peter Kischka (FSU Jena) who all gave helpful critical comments on previous versions of this paper. 2 Address for correspondence: Prof. Dr. Rolf Steyer, Friedrich-Schiller-Universität Jena, Am Steiger 3 Haus 1, D Jena, Germany. rolf.steyer@uni-jena.de 3 Center for Surveys, Methods, and Analyses (ZUMA), Mannheim, Germany 4 Friedrich-Schiller-University Jena, Germany

2 56 MPR-Online 2000, Vol. 5, No. 3 In a previous paper (Steyer, Gabler, von Davier, Nachtigall, & Buhl, 2000) we reformulated the theory of individual and average causal effects in terms of classical probability theory and illustrated it by some examples. This theory goes back to Neyman (1923/1990; 1935) and has been adopted and enriched by Rubin (1973a, b, 1974, 1977, 1978, 1985, 1986, 1990), Holland and Rubin (1983), Holland (1986, 1988a, b), Rosenbaum and Rubin (1983a, b, 1984, 1985a, b), Rosenbaum (1984a, b, c), and Sobel (1994, 1995), for instance. More specifically, we introduced a probabilistic framework: a potential causal regression model with discrete units and discrete treatment variable 5, which consists of three components: a probability space (Ω, A, P ), which represents the random experiment, i.e., the empirical phenomenon considered, a regression E(Y X ), the potential causal interpretation of which is focused, where X represents a discrete (but not necessarily univariate) treatment variable and Y a real-valued (and not necessarily continuous) response variable, a nonnumeric random variable U, the value of which is the observational unit drawn. The probability space is chosen such that X, Y, and U are random variables on (Ω, A, P ), i.e., X, Y, and U have a joint distribution. In this framework and notation, we presented the basic concepts of the theory of individual and average causal effects: the individual conditional expected values E(Y X = x, U = u), the individual causal effects ICE u (i, j) := E(Y X = x i, U = u) E(Y X = x j, U = u) of a treatment x i vs. a treatment x j, the causally unbiased expected value [denoted CUE(Y X = x)], and the average causal effect ACE(i, j) (see Figure 1 of Steyer et al., 2000). We also showed how these concepts are related to the conditional expected values E(Y X = x) and the prima facie effects PFE(i, j) = E(Y X = x i ) E(Y X = x j ) that are usually focused in statistical estimation and hypothesis testing. Specifically, it has been shown that both, (a) stochastic independence of U and X (that can be created via random assignment of units to treatment conditions) as well as (b) unit-treatment homogeneity, [i.e., E(Y X, U ) = E(Y X )] imply the equations E(Y X = x) = CUE(Y X = x)], and PFE(i, j) = ACE(i, j) [i.e., causal unbiasedness of E(Y X = x) and of PFE(i, j)]. We argued that the theory of individual and average causal effects reviewed above has some major limitations. The first one is that a proposition about causal unbiasedness is not empirically falsifiable: Postulating that the prima facie effect PFE(i, j) is equal to the average causal effect ACE(i, j) or that the conditional expected values E(Y X = x) are equal to the causally unbiased expected values CUE(Y X = x) of Y given x does not imply anything one could show to be wrong in an empirical appli- 5 For a more general formal framework, see Steyer (1992).

3 Steyer, Gabler, von Davier & Nachtigall: Causal Regression Models II 57 cation. The reason is that the computation of the ACE(i, j) and the CUE(Y X = x) involve all individual conditional expected values E(Y X = x, U = u) in the population and that it is not possible to estimate all of them. (See the fundamental problem of causal inference in Holland, 1986 or Steyer et al., 2000). The second limitation is that even if both PFE(i, j) = ACE(i, j) and E(Y X = x) = CUE(Y X = x) hold in the total population, the corresponding equations may not hold in the subpopulations, for instance, in the subpopulations of males and females. Hence, causal unbiasedness of the prima facie effects in the subpopulations might not hold although being correct in the total population. Because of the nonfalsifiability we argued that, although the theory of individual and average causal effects provides indispensable concepts, it is not really complete as a methodological basis for causal modeling outside the randomized experiment in which treatment assignment may not be random. Completing this methodological basis is the central goal of this paper. Specifically, we aim at enriching the theory of individual and average causal effects by introducing the concept of unconfoundedness of a regression E(Y X ). Unconfoundedness will be defined such that: (a) (b) a proposition that a regression E(Y X ) is unconfounded is empirically falsifiable, unconfoundedness of E(Y X ) implies that the regressions E W = w (Y X ) in the subpopulations, are causally unbiased as well. 6 Among all criteria fulfilling these requirements we prefer that criterion which is logically the weakest, i.e., which is implied by the others, but does not imply one of the others. If requirement (a) were not fulfilled, everybody could make causal propositions, nobody would have a chance to falsify it. Requirement (b) means that causal interpretations such as causal unbiasedness should be transferable from the total population into its subpopulations and in this sense be stable or invariant. Hence, the goal of this paper is to introduce a causality criterion that fulfills the requirements (a) and (b) mentioned above, compare it to some other criteria, illustrate it by an example and study its sufficient and necessary conditions. More specifically, we will present and discuss different causality criteria in section 1, including strong ignorability and unconfoundedness. The latter is, according to requirements (a) and (b), the most favorable one. In section 2 we will illustrate the different criteria by an example. Section 3 is devoted to a necessary and sufficient condition of unconfoundedness. Next we will study sufficient conditions of unconfoundedness (section 4), and then sufficient conditions of strong ignorability (section 5). Then we turn to the necessary conditions of 6 Each value w of W represents a subpopulation.

4 58 MPR-Online 2000, Vol. 5, No. 3 unconfoundedness (section 6). One of these necessary conditions is causal unbiasedness of E(Y X ). Section 7 focusses two general procedures to falsify unconfoundedness, which will be illustrated by an example in section 8. Section 9 deals with necessary conditions for unconfoundedness under special assumptions such as linearity and W- conditional linearity of E(Y X, W ), where W is a variable each value of which represents a subpopulation. Finally, the merits and limits of the theory presented are discussed in section Some Causality Criteria The weakest criterion we consider is causal unbiasedness of the treatment regression E(Y X ) in the total population, i.e., with E(Y X = x) = CUE(Y X = x), for each value x of X, (1) CUE(Y X = x) := Σ u (Y X = x, U = u) P(U = u), (2) where the summation is across all observational units u in the (total) population Ω U (i.e., the set of all units considered). This term has been defined to be the causally unbiased expected value of Y given x (Steyer et al., 2000). Equation (1) means that each E(Y X = x) is equal to the expected value of the individual expected values E(Y X = x, U = u) across the distribution of the observational units. Equation (1) is sufficient to imply PFE(i, j) = ACE(i, j), i.e., causal unbiasedness of the prima facie effects in the total population (see Steyer et al., 2000). The equation E(Y X = x) = CUE(Y X = x), defines causal unbiasedness of a conditional expected value E(Y X = x). From the perspective of the theory of individual and average causal effects, there is no doubt that Equation (1), and with it, the concept of causal unbiasedness, is a desirable and indispensable property of the conditional expected values E(Y X = x) in meaningful substantive applications. However, as mentioned before, Equation (1) neither implies anything that could be falsified in an empirical application [see requirement (a)] nor does it imply unbiasedness of the regressions E W = w (Y X ) in the subpopulations represented by W = w [see requirement (b)]. Hence, Equation (1) [causal unbiasedness of the regression E(Y X )] is too weak to qualify as a satisfactory causality criterion applicable to experimental studies with nonrandom assignment. 7 The same ar- 7 In a similar vein Pearl (1998) argues that unbiasedness might be incidental which necessitates the need for stable unbiasedness (see also Example III in Steyer et al., 2000).

5 Steyer, Gabler, von Davier & Nachtigall: Causal Regression Models II 59 gument applies, of course, to the criterion PFE(i, j) = ACE(i, j), causal unbiasedness of the prima facie effects. Both criteria lack empirical falsifiability. As a second criterion we consider strong ignorability described, e.g., by Rosenbaum and Rubin (1983a). They defined potential response variables Y i : Ω IR for each treatment condition x i. In our notation we could define these variables for each value x i by Y i (ω) = f (u) = E(Y X = x i, U = u), for all values u of U. Defined in this way, it is obvious that the variables Y i, i = 1,, n x, are functions of U and therefore have a joint distribution with X, because U and X have joint distributions. 8 Hence, stochastic independence of U and X implies that the variables Y 1,, Y and X are stochastically independent (denoted Y 1,, Y n x n x X ). The condition (a) Y 1,, Y n x X and (b) 0 < P(X = x i ) < 1, for each value x i of X has been called strong ignorability (see, e.g., Rosenbaum & Rubin, 1983a, p. 213). 9 Strong ignorability implies causal unbiasedness of the conditional expected values and the prima facie effects. However, it is neither empirically testable, and it is unknown whether or not it implies causal unbiasedness in the subpopulations. (See Note 1 in Appendix A). Let us now discuss several other sufficient conditions for causal unbiasedness that are empirically falsifiable. A third criterion we might consider is stochastic independence of X and U. This criterion, in fact, fulfills both requirements discussed above. It is empirically falsifiable and it implies causal unbiasedness of the conditional expected values E(Y X = x, W = w) in the subpopulations represented by W = w (see Th. 3 and Th. 6). However, Theorem 2 of Steyer et al. (2000) shows that there is a fourth criterion that is also falsifiable and implies causal unbiasedness of the conditional expected values E(Y X = x, W = w): unit-treatment homogeneity, i.e., E(Y X, U ) = E(Y X ). Hence, independence of U and X would be unnecessarily strong. The fifth criterion to be considered is X and U are independent or E(Y X, U ) = E(Y X ). 10 This would also fulfill both requirements discussed above. In fact, this criterion is already very close to our favorite one which is somewhat weaker and still fulfills the two requirements (a) and (b). The sixth criterion will be called unconfoundedness of the regression E(Y X ). 11 It is defined as follows. Note that throughout the paper we will presume that (Ω, A, P), E(Y X ), U is a potential causal regression model. 8 In Rubin s notation, the causally unbiased expected values would be denoted E(Y i ), i.e., E(Y i ) = CUE(Y X = x i ). 9 For simplicity, we restrict our discussion to the case where there is no concomitant variable or covariate. Rosenbaum and Rubin consider the case with a covariate. Also note that strong ignorability implies E(Y i ) = E(Y i X = x) for each pair (i, x) of indices i {1,, n x } and values x of X. 10 This criterion is very close to what has been proposed by Pearl (1998) in his definitions 2 and In previous papers (see e.g. Steyer, Gabler, & Rucai, 1996 and Steyer, von Davier, Gabler, & Schuster, 1997), we have defined unconfoundedness in a different but equivalent way (see Th. 2).

6 60 MPR-Online 2000, Vol. 5, No. 3 or Definition 1. E(Y X ) is called unconfounded if for each value x of X: P(X = x U = u) = P(X = x) for each value u of U (3) E(Y X = x, U = u) = E(Y X = x) for each value u of U. (4) This sixth criterion differs from the fifth one by the fact that Equation (3) or Equation (4) is postulated within each value x of X. (For an example, see Table 2.) Hence, within each treatment condition we require equal treatment assignment probabilities for each unit u [see Eq. (3)] or homogeneity of the units. Only one of these equations has to be true within each treatment condition x. In contrast, the fifth criterion requires Equation (3) to be true for all treatment conditions x or Equation (4) to be true for all treatment conditions x. Whereas from a substantive point of view this difference seems negligible, it is important from a mathematical point of view: the two criteria are not equivalent to each other. In fact, the fifth criterion implies the sixth criterion, but not vice versa. Table 1 displays a summary of the causality criteria discussed above. Table 1. Summary of the causality criteria discussed in the paper 1. Causal unbiasedness of E(Y X ) E(Y X = x) := E(Y X = x, U = u) P(U = u), for each value x of X 2. Strong ignorability Y 1,, Y X and 0 < P(X = x i ) < 1 n x 3. Independence of X and U X U 4. Unit-treatment homogenity E(Y X, U ) = E(Y X ) 5. Independence of X and U or unit-treatment homogeneity X U or E(Y X, U ) = E(Y X ) 6. Unconfoundedness of E(Y X ) For each value x of X: P(X = x U = u) = P(X = x) for each value u of U or E(Y X = x, U = u) = E(Y X = x) for each value u of U 2. An Example Table 2 gives an example illustrating the criteria discussed above. The example is constructed in such a way that unconfoundedness and causal unbiasedness but none of the other causality criteria hold. We consider three treatment conditions x 1 to x 3 and a

7 Steyer, Gabler, von Davier & Nachtigall: Causal Regression Models II 61 population of eight units. We assume that each unit has the probability P(U = u) = 1/8 to be sampled. Let us first look at treatment x 1. Column 4 displays the individual assignment probabilities for treatment x 1. They are all the same, namely 1/2, for each individual unit. For this specific treatment condition x 1 we have different individual conditional expected values E(Y X = x 1, U = u) (see column 5). For treatments x 2 and x 3 things are different. Column 6 contains different treatment assignment probabilities P(X = x 2 U = u) for treatment condition x 2, and column 7 displays equal individual conditional expected values E(Y X = x 2, U = u). The same is true for treatment condition x 3. Again, we have different treatment assignment probabilities P(X = x 3 U = u) but equal individual conditional expected values E(Y X = x 3, U = u). In order to check causal unbiasedness, we may first compute the three conditional expected values E(Y X = x 1 ) = 115, E(Y X = x 2 ) = 105, and E(Y X = x 3 ) = 110 from Equation E(Y X = x) := E(Y X = x, U = u) P(U = u X = x), (5) which always holds for a discrete random variable U. Using P(U = u X = x) = P(X = x U = u) P(U = u ) / P(X = x) (6) yields P(U = u X = x 1 ) = (1/2) (1/8) / (1/2) = 1/8 for each unit u and E(Y X = x 1 ) = 82 1/ /8 = 115. For the second treatment condition x 2 the corresponding formulas lead to P(U = u X = x 2 ) = (1/10) (1/8) / (1/4) = 1/20 for u 1 and u 2, to P(U = u X = x 2 ) = (2/10) (1/8) / (1/4) = 2/20 for u 3 and u 4, to P(U = u X = x 2 ) = (3/10) (1/8) / (1/4) = 3/20 for u 5 and u 6, and to P(U = u X = x 2 ) = (4/10) (1/8) / (1/4) = 4/20 for u 7 and u 8. Hence, Equation (5) now yields E(Y X = x 2 ) = 105 (1/20 + 1/20 + 2/20 + 2/20 + 3/20 + 3/20 + 4/20 + 4/20) = 105. Analogously, we can compute the conditional expected value E(Y X = x 3 ) = 110.

8 62 MPR-Online 2000, Vol. 5, No. 3 Table 2. An example in which the treatment regression E(Y X ) is unconfounded and causally unbiased but none of the other causality criteria discussed in this paper hold Observational-unit variable U P(U = u) W (gender) P(X = x 1 U = u) E(Y X = x 1, U = u) P(X = x 2 U = u) E(Y X = x 2, U = u) P(X = x 3 U = u) E(Y X = x 3, U = u) u 1 1/8 m 1/2 82 1/ / u 2 1/8 m 1/2 89 1/ / u 3 1/8 m 1/ / / u 4 1/8 m 1/ / / u 5 1/8 f 1/ / / u 6 1/8 f 1/ / / u 7 1/8 f 1/ / / u 8 1/8 f 1/ / / Note: The (unconditional) probabilities for the three treatments are P(X = x 1 ) = 1/2, P(X = x 2 ) = P(X = x 3 ) = 1/4. In order to check causal unbiasedness, the conditional expected values computed above have to be compared to the causally unbiased conditional expected values, which may be computed by Equation (2). For the treatment condition x 1 this equation yields CUE(Y X = x 1 ) := E(Y X = x 1, U = u) P(U = u) = ( ) 1/8 = 115, for treatment condition x 2 CUE(Y X = x 2 ) := E(Y X = x 2, U = u) P(U = u) = ( ) 1/8 = 105, u and for treatment condition x 3 CUE(Y X = x 3 ) := E(Y X = x 3, U = u) P(U = u) = ( ) 1/8 = 110. Hence, in this example, the conditional expected values are unbiased, i.e., E(Y X = x) = CUE(Y X = x) for each value x of X, and the prima facie effects PFE(i, j) = E(Y X = x i ) E(Y X = x j ) are unbiased as well, i.e., PFE(i, j) = ACE(i, j), for each pair of treatment conditions x i and x j.

9 Steyer, Gabler, von Davier & Nachtigall: Causal Regression Models II 63 We now check the other causality criteria discussed in the last section. The second causality criterion, strong ignorability, does not hold in this example. The independence condition Y 1,, Y n X implies the conditional expected values E(Y 1 X = x 1 ) and x E(Y 1 X = x 2 ) to be identical. However, in our example: E(Y 1 X = x 1 ) = = Y 1 (u) P[Y 1 =Y 1 (u) X = x 1 ] = Y 1 (u) P(U = u X = x 1 ) = u [Y 1 (u) [P(X = x 1 U = u) P(U = u ) / P(X = x 1 )]] = 82 1/ /8 = 115 and E(Y 1 X = x 2 ) = Y 1 (u) P[Y 1 = Y 1 (u) X = x 2 ] = Y 1 (u) P(U = u X = x 2 ) = = u [Y 1 (u) [P(X = x 2 U = u) P(U = u ) / P(X = x 2 )]] = ( ) 1/20 + ( ) 2/20 + ( ) 3/20 + ( ) 4/20 = 125. Hence, we can conclude that strong ignorability does not hold in this example. The third criterion, stochastic independence of X and U does not hold, because, unlike for treatment condition x 1, the individual assignment probabilities P(X = x 2 U = u) and P(X = x 3 U = u) are not the same for each observational unit u for treatment conditions x 2 and x 3. The fourth causality criterion, unit-treatment homogeneity also does not hold. Although there are equal individual conditional expected values E(Y X = x, U = u) in treatment conditions x 2 and x 3, these values differ in treatment condition x 1. The fifth causality criterion, stochastic independence of X and U or unit-treatment homogeneity does not hold as well. However, the sixth causality criterion, unconfoundedness does hold. In each treatment condition x we have either equal assignment probabilities P(X = x U = u) (for x 1 ) or equal individual conditional expected values E(Y X = x, U = u) (for x 2 and x 3 ). 3. Some Equivalent Formulations of Unconfoundedness We will now study several conditions which are equivalent to unconfoundedness as defined above. These conditions will provide the theoretical basis for empirical falsifiability. All these conditions involve random variables W for which there exists a map-

10 64 MPR-Online 2000, Vol. 5, No. 3 ping f such that W = f (U ) is the composition of U with f. As can easily be seen, such a variable W partitions the population Ω U into subpopulations (such as the set of male and the set of female persons). Presuming a potential causal regression model (Ω, A, P), E(Y X ), U and using these variables, we may formulate the following theorem. 12 Theorem 1. E(Y X ) is unconfounded if and only if for each value x of X and for each W = f (U ) P(X = x W = w) = P(X = x) for each value w of W (7) or E(Y X = x, W = w) = E(Y X = x) for each value w of W. (8) According to this theorem, unconfoundedness postulates, within each treatment condition x, equal treatment probabilities for each subpopulation w or equal conditional expected values across all subpopulations (represented by the different values w of W ). Note that, in empirical applications, this theorem already provides a basis for falsification of unconfoundedness, because, given a specific W = f (U ), each term in these two equations is empirically estimable. At least one of the two equations above has to be true for each W = f (U ) and a given value x of X. If neither Equation (7) nor Equation (8) holds for a given W = f (U ), we can conclude that E(Y X ) is confounded. The next theorem will show the surprising fact that unconfoundedness as defined by the sixth criterion (see Def. 1) is equivalent to unconfoundedness as defined by Steyer, Gabler, and Rucai (1996). 13 Theorem 2. E(Y X ) is unconfounded if and only if for each W = f (U ) E(Y X = x) = E(Y X = x, W = w) P(W = w), for each value x of X. (9) w Note that the summation is across all values w of W. According to this theorem, postulating Equation (9) for each variable W = f (U ) is equivalent to (general) unconfoundedness of the regression E(Y X ). Also note that Equation (9) is not equivalent with Equation (7) or Equation (8). Instead Theorem 2 formulates (via Th. 1) the equivalence between Equation (9) for each W = f (U ) and Equation (7) or Equation (8) for each W = f (U ). According to Equation (9) unconfoundedness implies that each conditional expected value E(Y X = x) is the average of the corresponding ex- 12 For proofs see Appendix B. 13 This formulation of unconfoundedness has also been called weak causality by Steyer (1992). There one will also find a formulation which is not restricted to discrete treatments and a discrete unit variable.

11 Steyer, Gabler, von Davier & Nachtigall: Causal Regression Models II 65 pected values E(Y X = x, W = w) in the subpopulations represented by the values w of W. Corollary 1. If the treatment regression E(Y X ) is unconfounded, then for each W = f (U ): PFE(i, j)= w [E(Y X = x i, W = w) E(Y X = x j, W = w)] P(W = w). (10) According to this corollary, unconfoundedness implies that each prima facie effect, i.e., each difference E(Y X = x i ) E(Y X = x j ) between two values of the regression E(Y X ), is the average of the differences E(Y X = x i, W = w) E(Y X = x j, W = w) across all subpopulations represented by different values w of W. Also note that Theorem 2 holds irrespective of any specific parameterization of the regression E(Y X ), be it linear or not. For instance, if Y is dichotomous with values 0 and 1 and X is numerical, Theorem 2 also holds for a logistic regression E(Y X ) = P(Y = 1 X ) = exp(γ 0 + γ 1 X ) / [1 + exp(γ 0 + γ 1 X )]. One might expect that Equation (9) is always true if W is discrete. However, Example Ia of Steyer et al. (2000) (with W = U ) shows that this expectation is wrong. An equation for the conditional expected values E(Y X = x) that is always true if W is discrete is: E(Y X = x) = w E(Y X = x, W = w) P(W = w X = x), for each value x of X. (11) Note that Equation (9) also provides a basis for empirical falsification of unconfoundedness, because each term in this equation is empirically estimable. Hence, both Theorems 1 and 2 provide possibilities for falsifying unconfoundedness. It will be useful to supplement the general concept of unconfoundedness by the concept of unconfoundedness with respect to a specific variable W. The definition will be such that general unconfoundedness is equivalent to unconfoundedness with respect to each W = f (U ). Definition 2. E(Y X ) is called unconfounded with respect to W = f (U ) if Equation (9) holds for each value x of X. What has been achieved so far? We have three different but equivalent ways of formulating unconfoundedness (see Table 3). This gives us a deeper understanding of this concept. Specifically, we have shown that, in contrast to causal unbiasedness and strong ignorability, unconfoundedness is empirically falsifiable (see versions 2 and 3 in Table 3).

12 66 MPR-Online 2000, Vol. 5, No. 3 Table 3. Three equivalent formulations of unconfoundedness 1. For each value x of X: P(X = x U = u) = P(X = x) for each value u of U or E(Y X = x, U = u) = E(Y X = x) for each value u of U 2. For each variable W = f (U ) and for each value x of X: P(X = x W = w) = P(X = x) for each value w of W or E(Y X = x, W = w) = E(Y X = x) for each value w of W 3. For each variable W = f (U ) and for each value x of X: E(Y X = x) := E(Y X = x, W = w) P(W = w), for each value x of X w 4. Sufficient Conditions for Unconfoundedness Contrasting unconfoundedness to other criteria, we will now show more formally that unconfoundedness is a comparatively weak sufficient condition for causal unbiasedness. This will be achieved studying some sufficient conditions for a treatment regression E(Y X ) to be unconfounded. The sufficient conditions will also help to understand the concept and how it is related to the experiment with random assignment of units to experimental conditions. Theorem 3. Each of the following conditions is sufficient for unconfoundedness of the regression E(Y X ): (i) U and X are stochastically independent; (ii) (Unit-treatment homogeneity) E(Y X, U ) = E(Y X ); (iii)(strong causality) For each W = f ( U ) there exists a function h such that E(Y X, W ) = E(Y X ) + h( W ). (12) Note that each of these conditions is sufficient for unconfoundedness of E(Y X ) but not necessary, i.e., each of them implies unconfoundedness of E(Y X ), but unconfoundedness of E(Y X ) neither implies (i), nor (ii), nor (iii). Condition (i), independence of X and U, is important for understanding the role of randomization, i.e., random assignment of units to treatment conditions. Randomization is the only way for the experimenter to deliberately create one of the sufficient conditions of unconfoundedness.

13 Steyer, Gabler, von Davier & Nachtigall: Causal Regression Models II 67 Equation (12) postulates the additive decomposability of the regression E(Y X, W ) into a function g (X ) and a function h ( W ), i.e., E(Y X, W ) = g (X ) + h ( W ) (13) with the additional requirement g (X ) = E(Y X ), which, if Equation (13) holds, is equivalent to E [ h ( W ) X ] = E [ h ( W )]. For W = U, Equation (13) may be called unit-treatment additivity. It postulates the additive decomposability of the regression E(Y X, U ) into a function g (X ) and a function h ( U ), with the additional requirement g (X ) = E(Y X ). Equation (13) does not allow for interactions between X and W in the sense of analysis of variance (or moderator effects in terms of moderator regression models.) Strong causality as defined by condition (iii) requires invariance of the individual causal effects across all observational units, because, for W = U, Equation (13) implies E(Y X = x 1, U = u ) E(Y X = x 2, U = u ) = g (x 1 ) + h ( u ) [g (x 2 ) + h ( u )] = g (x 1 ) g (x 2 ). (14) It should be noted that there is no sufficient condition for such an invariance of individual causal effects that could deliberately be created by the experimenter. Specifically, Equation (13) is not necessarily (and in fact seldom) true in the randomized experiment, in contrast, to unconfoundedness that always holds if there random assignment of units to treatment conditions. Obviously, unit-treatment homogeneity is a special case of unit-treatment additivity with h ( W ) = Sufficient Conditions for Strong Ignorability In this section we state some sufficient conditions for strong ignorability. According to our next theorem, the first two sufficient conditions of unconfoundedness mentioned in Theorem 3 are also sufficient for strong ignorability. Theorem 4. Each of the following conditions is sufficient for strong ignorability: (i) U and X are stochastically independent; (ii) (Unit-treatment homogeneity) E(Y X, U ) = E(Y X ); Note that each of these conditions is sufficient for strong ignorability but not necessary, i.e., each of them implies strong ignorability but strong ignorability neither implies (i) nor (ii).

14 68 MPR-Online 2000, Vol. 5, No Necessary Conditions of Unconfoundedness The first necessary condition formulated in this section relates unconfoundedness to causal unbiasedness (Th. 5). Next we show that unconfoundedness and, therefore, causal unbiasedness, can be transferred to each subpopulation. Other necessary conditions treated are important for causal modeling in nonrandomized experiments, because they are the logical basis for falsifying a hypothesis of unconfoundedness. Theorem 5. If the treatment regression E(Y X ) is unconfounded, then each of the following propositions hold: (i) E(Y X ) is causally unbiased, i.e., E(Y X = x) = CUE(Y X = x) for each value x of X; (ii) the prima facie effects are causally unbiased, i.e., PFE(i, j) = ACE(i, j) for all pairs (x i, x j ) of values of X. We now turn to a very powerful theorem according to which unconfoundedness of a regression E(Y X ) allows to conclude that unconfoundedness and, therefore, causal unbiasedness (see Th. 5), can be transferred to each subpopulation. This is not only important for the interpretation of expected values and their differences in subpopulations but also allows to conclude that all properties of an unconfounded regression also hold for the corresponding regressions within subpopulations. Theorem 6. If E(Y X ) is unconfounded and W = f (U ), then, for each value w of W, the regression E W = w (Y X ) of Y on X in the subpopulation represented by W = w is unconfounded as well. As mentioned before, this theorem implies that all properties of an unconfounded regression also apply for the regressions E W = w (Y X ) within the subpopulations. Instead of rewriting all propositions for the subpopulations we just consider the property of causal unbiasedness of the regressions E W = w (Y X ) of Y on X in the subpopulation represented by W = w. Corollary 2. If E(Y X ) is unconfounded and W = f (U ), then the regressions E W = w (Y X ) of Y on X in the subpopulations represented by W = w are causall unbiased, i.e., E (Y X = x, W = w) = E (Y X = x, U = u, W = w) P(U = u W = w), (15) u for each pair (x, w) of values of X and W Note that E(Y X = x, U = u) = E (Y X = x, U = u, W = w), because we presuppose that W is a measurable function of U (see the definitions of U in Steyer et al., 2000 and of W above).

15 Steyer, Gabler, von Davier & Nachtigall: Causal Regression Models II 69 Theorem 6 and Corollary 2 show that unconfoundedness is a stable property in the sense that it carries over from the total population to all subpopulations. In contrast, causal unbiasedness may be incidentally true in the total population but fail in any subpopulation (see Example III of Steyer et al., 2000). Table 4. Necessary conditions of unconfoundedness 1. The regressions E W = w (Y X ) in each subpopulation W = w are unconfounded and, therefore, causally unbiased 2. The prima facie effects PFE(i, j) in the total population as well as the prima facie effects PFE W = w (i, j) in each subpopulation are causally unbiased 3. For each variable W = f (U ) and for each value x of X: P(X = x W = w) = P(X = x) for each value w of W or E(Y X = x, W = w) = E(Y X = x) for each value w of W 4. For each variable W = f (U ) and for each value x of X: E(Y X = x) := Σ w E(Y X = x, W = w) P(W = w), for each value x of X Table 4 gives a summary of the most important necessary conditions of unconfoundedness. The first two necessary conditions are important for substantive interpretations, because they add meaning to the concept of unconfoundedness which is otherwise not easily seen. The second two necessary conditions are also sufficient. Conditions 1, 3, and 4 may be used when it comes to falsification trials of the hypothesis of unconfoundedness. It should be noticed that, for a given variable W Equation (7) or (8) is stronger than Equation (9), i.e., Equation (7) or (8) implies Equation (9) but not vice versa. 7. General Procedures to Falsify Unconfoundedness An important goal of this paper is to present a theory that may serve as a methodological foundation of causal modeling also outside the randomized experiment. Since a hypothesis that the conditional expected values E(Y X = x) are causally unbiased is not falsifiable, we introduced unconfoundedness of E(Y X ) as the weakest empirically falsifiable sufficient condition for causal unbiasedness. How to proceed if we want to try to falsify the hypothesis of unconfoundedness? A first general principle is to choose a variable W = f (U ) and compare the estimates on the right-hand side of Equation (9)

16 70 MPR-Online 2000, Vol. 5, No. 3 to the estimates on the right-hand side of Equation (11), i.e., to compare, for each value x of X, to E adj for W (Y X = x) := Σw E(Y X = x, W = w) P(W = w) (16) E(Y X = x) = Σw E(Y X = x, W = w) P(W = w X = x). (17) Note that the last equation is always true if W is discrete. If, in a specific empirical application, the estimate of E adj for W (Y X = x) cannot be assumed to be an estimate of E(Y X = x), then unconfoundedness, too, cannot be assumed to hold. According to the considerations above, falsifying in a specific application the hypothesis of unconfoundedness of E(Y X ) can be achieved by falsifying the null hypothesis of unconfoundedness of E(Y X ) with respect to a specific W = f (U ), i.e., by falsifying the hypothesis: E adj for W (Y X = x) E(Y X = x) = 0, for each value x of X. (18) This hypothesis is equivalent to Equation (9). It may also equivalently be written: Σw E(Y X = x, W = w) [P(W = w) P(W = w X = x)] = 0, for each x of X. (19) If the unconditional probabilities P(W = w) and the conditional probabilities P(W = w X = x) are known, Equation (19) is a linear hypothesis about the parameters E(Y X = x, W = w). A second general procedure can be based on condition 4 in Table 4. Again, choose a variable W = f (U ) and check if for each given value x of X condition holds or P(X = x W = w) = P(X = x) for each value w of W (20) E(Y X = x, W = w) = E(Y X = x) for each value w of W. (21) If there is a value x of X for which none of these conditions hold, then unconfoundedness does not hold as well. 8. Another Example The example displayed in Table 5 serves to illustrate the general procedures to falsify unconfoundedness outlined in the last section. Of course, a simple look at treatment

17 Steyer, Gabler, von Davier & Nachtigall: Causal Regression Models II 71 condition x 2 reveals that the requirements formulated in Definition 1 do not hold in this example. For x 2 we neither have constant assignment probabilities across units nor constant individual conditional expected values E(Y X = x 2, U = u) across units. As a consequence, the conditional expected value E(Y X = x 2 ) is causally biased. Whereas the causally unbiased expected value is CUE(Y X = x 2 ) = 105, the conditional expected value E(Y X = x 2 ) is equal to 115. Hence, the prima facie effect PFE(1, 2) = E(Y X = x 1 ) E(Y X = x 2 ) = = 0 is considerably different from the average causal effect ACE(1, 2) = = 10. Whereas the average causal effect is positive, the prima facie effect is zero. In the last paragraph we used Definition 1 to check unconfoundedness. However, in empirical applications, the data on the individual unit level are neither available nor estimable. Hence, Definition 1 cannot be used in empirical falsification trials. What is estimable, however, are the data within subpopulations such as gender (see column 3 of Table 5). The conditional expected values E(Y X = x, W = w) may be computed by the formula E(Y X = x, W = w) = E(Y X = x, U = u, W = w) P(U = u X = x, W = w ), where E(Y X = x, W = w, U = u) = E(Y X = x, U = u), because W = f (U ). Hence, for this purpose we have to compute the conditional probabilities P(U = u X = x, W = w ). Using the data of Table 5 we receive for X = x 2 and W = m: E(Y X = x 2, W = m) = 68 1/ / / /6 = 88.5, and for X = x 2 and W = f : E(Y X = x 2, W = f ) = 112 3/ / / /14 = The unconditional probabilities P(W = w) are P(W = f ) = P(W = m) = 1/2, whereas the conditional probabilities P(W = w X = x 2 ) are P(W = m X = x 2 ) = 3/10 and P(W = f X = x 2 ) = 7/10. Inserting these results into [see Eq. (19)] yields w E(Y X = x 2, W = w) [P(W = w) P(W = w X = x 2 )], 88.5 (1/2 3/10) (1/2 7/10) = =

18 72 MPR-Online 2000, Vol. 5, No. 3 Hence, this is a contradiction to the hypothesis that the regression E(Y X ) is unconfounded in the example of Table 5 [see Eq. (19)]. The second general procedure outlined in section 7 is easier to follow in this example. It easily seen from Table 5 that neither Equation (20) nor Equation (21) holds for X = x 2 and W = m, for instance. It should be emphasized again that all parameters necessary for the general procedures to falsify unconfoundedness illustrated in this example are estimable in real empirical applications. In this sense unconfoundedness is empirically falsifiable. Table 5. An example in which the treatment regression E(Y X ) is confounded Observational-unit variable U P(U = u) W (gender) P(X = x 1 U = u) E(Y X = x 1, U = u) P(X = x 2 U = u) E(Y X = x 2, U = u) P(X = x 3 U = u) E(Y X = x 3, U = u) u 1 1/8 m 1/2 82 1/ / u 2 1/8 m 1/2 89 1/ / u 3 1/8 m 1/ / / u 4 1/8 m 1/ / / u 5 1/8 f 1/ / / u 6 1/8 f 1/ / / u 7 1/8 f 1/ / / u 8 1/8 f 1/ / / Note: The (unconditional) probabilities for the three treatments are P(X = x 1 ) = 1/2, P(X = x 2 ) = P(X = x 3 ) = 1/4. 9. Special Procedures to Falsify Unconfoundedness In the last two sections, we did not presuppose anything about the functional form of the regression E(Y X, W ). In empirical applications, the procedure outlined in these sections can be used if the sample is big enough to estimate each conditional expected value E(Y X = x, W = w) sufficiently well and if the probabilities P(X = x W = w) are known or can also be estimated sufficiently well. If, however,

19 Steyer, Gabler, von Davier & Nachtigall: Causal Regression Models II 73 compared to the sample size, the number of different combinations (x, w) of values of X and W is too large, then there might even be pairs (x, w) for which there are no observations at all. If, however, we are able to make an assumption about the functional form of the regression E(Y X, W) sparseness of data are less of a problem. We will consider two such assumptions: linearity [Eq. (22)] and conditional linearity [Eq. (32)]. The case of linearity has already been considered by Pratt and Schlaifer (1988), for instance. In the next theorem, X = (X 1,, X p )' denotes a p-dimensional regressor and W = (W 1,..., W q )' a q-dimensional function of U. Both, X and W are column vectors. 15 Theorem 7. If (a) W = f (U ), (b) the treatment regression E(Y X ) is unconfounded (or at least unconfounded with respect to W), (c) the covariance matrix Σ of the (p + q)-dimensional row vector (X ', W ') is regular, and (d) E(Y X, W ) = β 0 + X ' β X + W ' β W, β 0 IR, β X IR p, β W IR q, (22) then E(Y X ) = α 0 + X ' α X, α 0 IR, α X IR p (23) α X = β X, (24) E(W ' X ) β W = E(W ') β W, (25) ΣXW β W = 0. (26) In Equation (26) Σ XW denotes the covariance matrix of X and W. Hence, under the linearity assumption [Eq. (22)], unconfoundedness of E(Y X ) can be falsified in a specific empirical application, if it is shown that α X = β X does not hold. Alternatively, one might try to show that ΣXW β W = 0, does not hold. Hence, tests of α X = β X and tests of Σ β W = 0 are tests of unconfoundedness, provided that W = XW f (U ) and Equation (22) is true. Usually, a test of linearity or goodness of fit of Equation (22) should precede a test of unconfoundedness. Also note that, given the presuppositions of Theorem 7 [including Eq. (22)], Equations (24) and (26) are equivalent. Observe that the vectors of regression coefficients β X and β W can be computed by 1 β X = ( Σ XX + Γ E 1 Γ ') Σ XY Γ E 1 Σ WY, (27) 15 Note that all concepts treated in this paper are not restricted to a univariate treatment variable X. If X is p-variate, a value x of X consists of p values x 1,, x p. Since we are now formulating the models in terms of matrix algebra, we will denote the treatment variable by the boldface letter X. The same holds true, of course, for W.

20 74 MPR-Online 2000, Vol. 5, No. 3 and β W = E 1 ( Σ WY Γ ' Σ XY ), (28) 1 where E := Σ WW ΣWX Σ 1 XXΣ XW and Γ := Σ XXΣ XW (see, e.g., Seber, 1984). Also note that, for p = q = 1, Equations (27) and (28) simplify to the well-known formulas for partial regression coefficients: and β X = ( σ σ σ σ ) / ( σ σ σ ) W XY WX WY (29) X W WX β W = ( σ σ σ σ ) / ( σ σ σ ) X WY WX XY. (30) X W WX In Theorem 7 we presuppose that the regression E(Y X, W ) is additive. Especially, Equation (22) does not allow for multiplicative terms such as X W, which occur, e.g., in the following equation: E(Y X, W ) = β 0 + β 1 X + β 2 W + β 3 X W (31) = (β 0 + β 2 W) + (β 1 + β 3 W) X = g 0 (W) + g 1 (W) X. Note that E(Y X, W ) can always be parameterized in the form of Equation (31), if both X and W are dichotomous. In other cases, however, Equation (31) may not hold for the regression E(Y X, W ). The values of the slope coefficient function g 1 (W) are the values of the slope coefficients of the (W = w)-conditional linear regressions E W = w (Y X ) = g 0 (w) + g 1 (w) X. If, e.g., W is gender, E W = m (Y X ) may represent the regression of Y on X in the subpopulation of males (m). In the next theorem, we consider a special case which allows for a p-dimensional regressor X = (X 1,..., X p ) '. The basic idea behind Equation (32) is that for each given value w of W, the regression of Y on X is an additive linear multiple regression with the p-variate numerical regressor X = (X 1,..., X p ) '. Theorem 8. If (a) the treatment regression E(Y X ) is unconfounded, (b) X: Ω IR p, (c) W = f (U ), and (d) there are functions g 0,, g p such that E(Y X, W ) = g 0 (W) + g 1 (W) X g p (W) X p, (32) then E(Y X ) = α 0 + α 1 X α p X p (33)

21 Steyer, Gabler, von Davier & Nachtigall: Causal Regression Models II 75 with α 0 = E[g 0 (W)], α 1 = E[g 1 (W)],..., α p = E[g p (W)]. (34) According to this theorem, the partial regression coefficients of the regressors X 1,..., X p are the expected values of the corresponding (W = w)-conditional partial regression coefficients if unconfoundedness holds. Furthermore, this result may also be used for falsification trials of the hypothesis that E(Y X ) is unconfounded. If the Equations (33) and (34) do not hold, although Equation (32) does, then we can reject the hypothesis that E(Y X ) is unconfounded. For p = 1, this theorem shows how the slope coefficient α 1 of a simple linear regression E(Y X ) = α 0 + α 1 X is related to the slope coefficient function g 1 (W), if unconfoundedness holds. A special case of Equation (32) results if X 1 := X 1,..., X p := X p, i.e., if the regression of Y on X is a polynomial function of the one-dimensional numeric regressor X. 10. Discussion In this paper we introduced the concept of unconfoundedness of a treatment regression E(Y X ) and showed its relationship to Neyman s and Rubin s concepts of individual and average causal effects. We discussed several criteria that are relevant in causal modeling. The weakest is causal unbiasedness of the conditional expected values E(Y X = x) [see Eq. (1)] which has the disadvantage of being nonfalsifiable. Other conditions were independence of U and X, as well as unit-treatment homogeneity: E(Y X, U ) = E(Y X ) and the and/or-combination of the two. All these conditions are sufficient conditions for causal unbiasedness but they are unnecessarily strong. We finally introduced unconfoundedness as the weakest empirically falsifiable sufficient condition for causal unbiasedness. We defined a regression E(Y X ) to be unconfounded if, for each value x of X: (a) P(X = x U = u ) = P(X = x) for each value u of U, or (b) E(Y X = x, U = u ) = E(Y X= x ) for each value u of U. This concept goes beyond the observational level and still has many implications, some of which are empirically testable. Unconfoundedness implies that the conditional expected values E(Y X = x) and their differences E(Y X = x i ) E(Y X = x j ), the prima facie effects, are causally unbiased, i.e. E(Y X = x i ) E(Y X = x j ) = ACE(i, j). This provides the link to the concept of the average causal effect. Furthermore, we showed that unconfoundedness of E(Y X ) is sufficient to imply that the conditional expected values E(Y X = x, W = w) and their differences E(Y X = x i, W = w) E(Y X = x j, W = w) are causally unbiased in every subpopulation W = w (see Corollary 2).

22 76 MPR-Online 2000, Vol. 5, No. 3 The sufficient conditions for unconfoundedness of E(Y X ) may guide us in the design of experiments. Most important, however, they make clear that random assignment of units to experimental conditions serves to secure stochastic independence of X and U, which implies unconfoundedness and causal unbiasedness. Such an experiment with random assignment of observational units to experimental conditions is in fact the only way to deliberately create the circumstances in which average causal effects can be estimated. Although unit-treatment homogeneity, [i.e., E(Y X, U ) = E(Y X )] is also sufficient for unconfoundedness (see Th. 3), this condition can not deliberately be created through techniques of experimental design. Nevertheless, both sufficient conditions for unconfoundedness, independence of X and U, and unit-treatment homogeneity, show that causal modeling in randomized and nonrandomized experiments aims at the same goal: to secure causal unbiasedness. Hence, if it were possible to generalize the theory presented such that it could also cover causal modeling in observational (i.e., nonexperimental) studies, in which the X-variables do not represent experimental conditions, the two traditions of theorizing about causality in statistics identified by Cox (1992), experimental and observational, could be integrated into a single theory. The necessary conditions for unconfoundedness of E(Y X ) may guide us in nonrandomized experimental causal modeling, because they provide empirically testable consequences. The theorems presented show which statistical hypotheses have to be tested and rejected in order to falsify the hypothesis of unconfoundedness of E(Y X ) in a specific application. What happens if a W has been identified with respect to which there is confounding? First, we may look at the regressive dependencies of Y on X within each subpopulation defined by W = w, i.e., we may look at the conditional effects. In fact, this provides more detailed information about the conditional regressive dependence of Y on X given W = w. Note that this procedure is in full accordance with the framework presented in this paper. One would just replace the total population Ω U by the subpopulation defined by W = w. This procedure is, in principle, equivalent with statistically controlling for the covariate vector W provided that the equation for the regression E(Y X, W) is adequately specified. Second, if we are willing to assume that the conditional regressions E W=w (Y X ) are unbiased, we may also compute the average causal effects via Equation (16), because, under this assumption ACE(i, j) = E adj for W (Y X = x i ) E adj for W (Y X = x j ) (for details, see Wüthrich-Martone, Steyer, Nachtigall & Suhl, 1999). A limitation requiring a more general formulation of our approach follows from assuming P(X = x, U = u) > 0. Whereas this assumption has been useful for a simple presentation, it restricts generality because it is not compatible with the assumption

23 Steyer, Gabler, von Davier & Nachtigall: Causal Regression Models II 77 that X has a normal distribution, for instance, which is often made in nonexperimental causal modeling (see, e.g., Bollen, 1989). Since normality is not really necessary in structural equation models (see, e.g., Browne & Arminger, 1995), this restriction is not too serious. Nevertheless, we should look for a more general theory which is not restricted to discrete treatments and units. Another open problem is how the theory presented relates to statistical sampling models. Especially, statistical tests and procedures have to be developed to help us in deciding whether or not unconfoundedness holds in a specific application. For some cases, these tests and procedures have already been presented (see, e.g., Allison, 1995; Clogg et al., 1992; Clogg et al., 1995; Steyer, Gabler, Rucai & Schuster, 1997; von Davier, 2000), for others they still need to be developed. A significance test of unconfoundedness in very large samples may not really be meaningful in many empirical applications, because the exact null hypothesis will rarely hold in nonrandomized experiments. Instead, one might rather be interested in estimating a coefficient reflecting the strength of confounding of a regression E(Y X ) with respect to a specific variable W. Such a coefficient has been proposed by Steyer, Gabler, and Rucai (1996). More research and experience will be necessary to work out principles dealing with the practical relevance of confoundings in concrete empirical studies. References [1] Allison, P. D. (1995). The impact of random predictors on comparisons of coefficients between models: Comment on Clogg, Petkova, and Haritou. American Journal of Sociology, 100, [2] Bollen, K. A. (1989). Structural equation models with latent variables. New York: Wiley. [3] Browne, M. W. & Arminger, G. (1995). Specification and estimation of mean- and covariance-structure models. In G. Arminger, C. C. Clogg & M. E. Sobel (eds.), Handbook of Statistical Modeling for the Social and Behavioral Sciences (pp ). New York: Plenum. [4] Clogg, C. C., Petkova, E., & Shidadeh, E. S. (1992). Statistical methods for analyzing collapsibility in regression models. Journal of Educational Statistics, 17, [5] Clogg, C. C., Petkova, E., & Haritou, A. (1995). Statistical methods for comparing regression coefficients between models. American Journal of Sociology, 100,

Lehrstuhl für Methodenlehre und Evaluationsforschung Institut für Psychologie FSU Jena. An IRT model with person-specific item difficulties

metheval Lehrstuhl für Methodenlehre und Evaluationsforschung Institut für Psychologie FSU Jena An IRT model with person-specific item Rolf Steyer Friedrich Schiller Universität Jena Institut für Psychology