Causal Inference with Counterfactuals

Causal Inference with Counterfactuals Robin Evans robin.evans@stats.ox.ac.uk Hilary 2014 1 Introduction What does it mean to say that a (possibly random) variable X is a cause of the random variable Y? Certainly X does not determine the value of Y, so how can we claim causality through this noise? Example 1.1. Suppose you have a headache, and decide to take an aspirin. An hour later the headache is gone is this because you took the aspirin? Note we are not asking whether aspirin cures headaches in some more general sense, we wish to know whether this specific headache went away because of the decision to take aspirin. The only sensible way to answer this sort of question in the case of a specific event is to compare the outcome which you observed with the counterfactual outcome which you would have observed if you had chosen not to take the aspirin. Let X = 1 if the aspirin is taken, and 0 otherwise. Denote by Y = 1 the event that the headache has disappeared after an hour, Y = 0 otherwise. We might imagine that there is really a second, outcome which, contrary-to-fact, corresponds to what would have happened if you had not taken the aspirin. We can imagine this as a piece of missing data (Neyman, 1923; Rubin, 1974). We imagine therefore that we are characterized by a pair of variables (Y 0, Y 1 ), with four possible values. There is an individual-level causal effect if Y 0 Y 1 : the possible pairs are characterized in this table. Y 0 Y 1 Type t Y 0 0 Never Recover NR 0 1 Helped HE 1 0 Hurt HU 1 1 Always Recover AR Of course, we can logically only ever observe one of these two outcomes Y 0 or Y 1 : this is called the fundamental problem of causal inference. Definition 1.2. We will define Y Y X. This is sometimes formulated as the consistency assumption: X = x = Y = Y x. 1

C X Y X Y (a) (b) Figure 1: Graphs corresponding to (a) no confounding and (b) confounding by C. 1.1 Ignorability The fundamental problem means that is essentially impossible to identify individual causal effects. However, a lesser goal would be to consider E[Y 1 Y 0 ]. By consistency, if X = x, then Y x = Y. Hence E[Y X = x] = E[Y x X = x] = EY x if Y x X. Then inference about Y x can be done just by looking at the observed values of Y such that X = x. Definition 1.3. If Y x X then the marginal distribution of Y x is identifiable (and is the same as that of Y X = x). This assumption is called ignorability or sometimes exchangability. Remark 1.4. In our example ignorability requires that Y 0 X and Y 1 X, but not the joint independence (Y 0, Y 1 ) X, sometimes called strong ignorability. The latter is, of course, untestable, since it involves assumptions about different worlds. Although it is hard to think of examples in which the marginal independences hold but the joint independences do not, it is good practice to avoid untestable assumptions wherever possible! Ignorability might be interpreted as saying that the mechanism which assigns treatments is independent of the mechanism which turns those treatments into outcomes. Note that it does not imply that Y X: clearly Y Y X depends upon X in general. The idea of separating inputs and mechanisms for causal inferences has been exploited in other contexts (Janzing et al., 2012). Ignorability is related, but not identical, to a lack of confounding. Confounding is, in essence, when there is a common cause of X and Y, as represented by the node C in Figure 1(b). Any correlation between X and Y will be a combination of the causal effect of X on Y (represented by the direct arrow) and spurious correlation due to the common cause. In 2

C X x Y (x) X x Y (x) (a) (b) Figure 2: Single world intervention templates (SWITs) resulting from splitting node X in the graphs in Figure 1. the counterfactual world, this means that X is correlated with Y (x), as in Figure 2(b). When ignorability holds as in Figure 1(a) splitting the node X into its two components shows that X Y (x) (Richardson and Robins, 2013). Of course, the most common reason for ignorability holding is that treatment is randomized, and the estimator given is just the difference between two treatment groups. If confounding is measured, then we may obtain conditional ignorability, which requires that X Y (x) C for each x. In this case inference can proceed more or less as in the unconfounded case. In some cases randomization may be conditional upon a covariate (over sampling certain groups is common), or we may simply pick covariates which we believe are confounders in order to get sensible estimates. 1.2 Causal Effects Definition 1.5. The average causal effect is defined as ACE X Y E[Y (1) Y (0)]. Given i.i.d. observations (X i, Y i ), i = 1,..., n and assuming ignorability, we can estimate the average causal effect by ÂCE X Y = i X iy i i X i(1 X i)y i i i (1 X i). This is essentially a trick to exploit the linearity of expectations. For a positive continuous outcome Y, a perfectly reasonable measure of the causal effect would be E[Y (1)/Y (0)], but there is no simple way to estimate this using ordinary data. The null hypothesis of no population-level causal effect simply states that Y (0) Y (1), so the two potential outcomes are exchangable. This means in particular that the average causal effect is zero. This does not imply that there is no individual-level causal effect, which would require that Y i (0) = Y i (1) for each i. This is called the sharp null hypothesis. 1.3 Non-Compliance Consider a randomized trial in which patients are assigned a treatment Z, which they may or may not then choose to follow. Let X be the treatment actually taken, and Y some 3

t X, t Y Z X Y Figure 3: Graphs representing the non-compliance model with the exclusion restriction. outcome of interest. Since X is not randomly assigned, we cannot assume ignorability of Y given X: perhaps those most likely to take the treatment are also the healthiest patients. Example 1.6. The data in the table are from a randomized clinical trial of a treatment for high cholesterol. A treatment, Z was randomly assigned, and the treatment actually taken was recorded as X. An outcome measure based on reduced cholesterol levels was recorded as Y. The data are discussed in detail by Efron and Feldman (1991); the original continuous measurements were dichotomized by Pearl (2010). z x y count z x y count 0 0 0 158 1 0 0 52 0 0 1 14 1 0 1 12 0 1 0 0 1 1 0 23 0 1 1 0 1 1 1 78 We are interested in the causal effect of the treatment taken (X) on the outcome (Y ). We can obtain a log-odds ratio of 3.3 with 95% confidence interval (2.7, 3.9). This suggests a very strong effect but is, of course, is just an association. We can consider the intention-to-treat effect of Z on Y, which gives 2.6 (2, 3.2). However, the conditions which influence the intention-to-treat effect are likely to be different in the world outside the trial. To model the non-compliance, we allow X to be one of two potential outcomes X z, which define an individual s compliance type: X(0) X(1) Type t X 0 0 Never Taker NT 0 1 Complier CO 1 0 Defier DE 1 1 Always Taker AT In general Y may be one of four potential outcomes, Y (x, z), giving up to 16 different types. However, we will enforce the exclusion restriction: Y (x, z) = Y (x, z ) for all z, z, so that the treatment assigned has no effect on the outcome other than through the treatment actually taken. This may or may not be realistic! We can then reduce the types for Y to the four usual ones. Define γ x i = E[Y (x) t X = i], i.e. the mean outcome for compliance type i when receiving 4

treatment x. We can define the average causal effect in this case as ACE X Y = i = i π i (E[Y 1 t X = i] E[Y 0 t X = i]) π i (γ 1 i γ 0 i ). It is not immediately clear whether or not this quantity is identifiable. 1.4 Identifiability and Inference Definition 1.7. A parameter is identifiable if it is a function of the observable probability distribution. A parameter is semi- or partially identifiable if the range of values it can take is restricted for some values of the observable probability distribution. Otherwise we say it is unidentifiable. The definition essentially state that something is identifiable if it can be determined precisely from a sufficiently large amount of data. Partial idetifiability implies that we may be able to narrow down the range of possible values with data, but not identify it preciesly. In the case of semi-identifiability there will be a range of possible values which are compatible with the observed probability distribution, so inference respecting the likelihood principle will not distinguish between them. Of course, we can assign a prior to the value of a parameter, but inference will be very sensitive to the choice of prior within the range of compatible values (Richardson et al., 2011). Denote by p X Z (x z) P (X = x Z = z) and p Y XZ (y x, z) P (Y = y Z = z, X = x) the observable conditional probability distributions. We will ignore the marginal distribution of P (Z = z), since randomization means that this is chosen by design, and is independent of all the potential outcomes. Example 1.8. To simplify matters, suppose that the treatment is not available to patients unless they are assigned to it, so that X(0) = 0. In other words, there are no Always Takers or Defiers. Then ACE X Y = π CO (γ 1 CO γ 0 CO) + π NT (γ 1 NT γ 0 NT). The quantities π CO and π NT are identifiable since In addition, π CO = p X Z (1 1), π NT = 1 π CO. γ 1 CO = p Y XZ (1 1, 1) γ 0 NT = p Y XZ (1 0, 1), since under our assumptions anyone who is observed to take the treatment is a complier, and anyone who fails to take the treatment when assigned it is a never taker. Also, p XY Z (0, 1 0) = π CO γ 0 CO + π NT γ 0 NT, so γ 0 CO can be recovered using the other information. 5

However, γnt 1 is totally unidentifiable, since we never observe what happens to a Never Taker unless they take the drug. Hence (exercise) ACE X Y = p XY Z (1, 1 1) p XY Z (0, 1 0) + γ 1 NT p X Z (0 1), where γnt 1 [0, 1]. This gives (exercise) p XY Z (0, 1 0) + p XY Z (1, 0 0) ACE X Y 1 p XY Z (0, 1 0) p XY Z (1, 0 1). The parameter ACE X Y is semi-identifiable, because its value depends upon γnt 1, but we can obtain non-trivial bounds. If one is willing to make assumptions about the effect of the treatment on never takers, then we can obtain tighter bounds. This separation of causal quantities into completely identifiable and completely unidentifiable is a sensible way of keeping track of our assumptions. A prior on an unidentified parameter will never be updated, regardless of how much data we collect. 1.5 Where logical positivists fear to tread The advantage of a view which allows for counterfactuals is that one can define causal concepts which otherwise do not exist. It is not possible to make the assumption of no Always Takers and no Defiers by setting p XY Z (1, y 0) = 0 for y {0, 1}. Therefore this constraint has an interpretation even without assuming the existence of well-defined counterfactual outcomes. For example, the complier average causal effect is defined as the effect of the treatment on individuals for whom (X 0, X 1 ) = (0, 1). That is: ACE CO γ 1 CO γ 0 CO. Unsurprisingly, this quantity does not have an interpretation without assuming the existence of well-defined compliance types. However, it is identifiable under the same assumptions as above. Such quantities may, under certain circumstances, be viewed as being more relevant than an intention-to-treat effect for the whole population, or enable us to make inferences about what could be achieved by a future intervention. For example, how much benefit would there be in educating Never Takers about the advantages of the treatment? 6

2 Probability Bounds Suppose we have a binary joint probability distribution p xy = P (X = x, Y = y) for x, y {0, 1}. If we observe the marginal distributions p x+ and p +y, what can we deduce about the joint distribution? The answer comes in the form of the Fréchet bounds: for any events A, B, there exists a probability distribution with probabilities P(A), P(B), P(A B) if and only if max{0, P(A) + P(B) 1} P(A B) min{p(a), P(B)}. The result is almost trivial, but as an illustration we show how to prove this with algebraic variable elimination. 1 library(rporta) M = rbind(cbind(diag(4), 0,0,0), c(1,1,1,1,0,0,1), c(1,1,0,0,-1,0,0), c(1,0,1,0,0,-1,0)) # first four columns represent joint probs, next two marginal # probs, last column constants M ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] ## [1,] 1 0 0 0 0 0 0 ## [2,] 0 1 0 0 0 0 0 ## [3,] 0 0 1 0 0 0 0 ## [4,] 0 0 0 1 0 0 0 ## [5,] 1 1 1 1 0 0 1 ## [6,] 1 1 0 0-1 0 0 ## [7,] 1 0 1 0 0-1 0 # hence M[6,] means p{00} + p{10} - p{+0} (which we will set =0) sign = c(rep(1,4),0,0,0) X = as.ieqfile(m, sign) X ## DIM = 6 ## ## INEQUALITIES_SECTION ## (1) 1x1 + 0x2 + 0x3 + 0x4 + 0x5 + 0x6 >= 0 ## (2) 0x1 + 1x2 + 0x3 + 0x4 + 0x5 + 0x6 >= 0 ## (3) 0x1 + 0x2 + 1x3 + 0x4 + 0x5 + 0x6 >= 0 ## (4) 0x1 + 0x2 + 0x3 + 1x4 + 0x5 + 0x6 >= 0 ## (5) 1x1 + 1x2 + 1x3 + 1x4 + 0x5 + 0x6 == 1 ## (6) 1x1 + 1x2 + 0x3 + 0x4-1x5 + 0x6 == 0 1 The package rporta is no longer on CRAN, but can be installed manually from the archive. 7

## (7) 1x1 + 0x2 + 1x3 + 0x4 + 0x5-1x6 == 0 ## ## END # we will eliminate joint probs except p{00} X@elimination_order = c(0,1,2,3,0,0) fmel(x) ## DIM = 6 ## ## INEQUALITIES_SECTION ## (1) -1x1 + 0x2 + 0x3 + 0x4 + 0x5 + 0x6 <= 0 ## (2) 1x1 + 0x2 + 0x3 + 0x4-1x5 + 0x6 <= 0 ## (3) 1x1 + 0x2 + 0x3 + 0x4 + 0x5-1x6 <= 0 ## (4) -1x1 + 0x2 + 0x3 + 0x4 + 1x5 + 1x6 <= 1 ## ## END # one can check these are just the Frechet bounds The situation under the potential outcomes framework is very similar: in principle we can observe the marginal distributions of Y 0 and Y 1, but not their joint distribution. We cannot be certain of the proportion of people π DE who are defiers, for example. However, since p X Z (0 1) = π DE + π NT, we can deduce that π DE p X Z (0 1), and therefore it is semi-identifiable. Returning to the non-compliance setting, let π ij = P(X 0 = i, X 1 = j) be the proportions of individuals who are of a particular response type. Then, for example, max{0, P(X 0 = 0) + P(X 1 = 1) 1} π 01 min{p(x 0 = 0), P(X 1 = 1)}. If we assume monotonicity (so that no-one is hurt), then π 11 = P(X 0 = 1), π 00 = P(X 1 = 0), π 01 = 1 π 00 π 11, and the strata proportions magically become identifiable (recall that, under ignorability, P(X z = x) = P(X = x Z = z)). This may be realistic in certain non-compliance settings, for example, and often enables identification. 2.1 Causal Effect Bounds We can set up a much larger example to deal with the average causal effect in the noncompliance model. M = matrix(0, 26, 26) # define observed probabilities in terms of POs M[1,c(2,4,6,8)] = 1 # p(0,0 0) 8

M[2,c(11,12,15,16)] = 1 M[3,c(1,3,5,7)] = 1 M[4,c(9,10,13,14)] = 1 M[5,c(6,8,14,16)] = 1 M[6,c(3,4,11,12)] = 1 M[7,c(5,7,13,15)] = 1 M[8,c(1,2,9,10)] = 1 # p(1,1 1) M[1:8,17:24] = -diag(8) # set positive probabilities M[9:24,1:16] = diag(16) M[,26] = c(rep(0,25),1) # define ACE M[25,] = c(rep(c(0,1,-1,0),4),rep(0,8),-1,0) # sum to 1 M[26,] = c(rep(1,16),rep(0,8),0,1) sign = c(rep(0,8),rep(1,16),0,0) X = as.ieqfile(m, sign) X@elimination_order = c(1:16,rep(0,9)) out = fmel(x) # inequalities involving ACE out@inequalities@num[-c(1:8,11:12,19,20,22,26),17:26] ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 0-1 -1 0 0 0 0 0-1 0 ## [2,] 0 0 0 0 0-1 -1 0-1 0 ## [3,] 0-1 -1-1 0 0 0 1-1 0 ## [4,] 0 0 0 1 0-1 -1-1 -1 0 ## [5,] 0-1 -1 1 0 0-1 -1-1 0 ## [6,] 0 0-1 -1 0-1 -1 1-1 0 ## [7,] 0-2 -2-1 0 0 1 1-1 0 ## [8,] 0 0 1 1 0-2 -2-1 -1 0 ## [9,] 0 0 0 0 0 1 1 0 1 1 ## [10,] 0 0 1 0 0 1 0 0 1 1 ## [11,] 0 1 0 0 0 0 1 0 1 1 ## [12,] 0 1 1 0 0 0 0 0 1 1 ## [13,] 0 0-1 -1 0 1 2 0 1 1 ## [14,] 0 1 2 0 0 0-1 -1 1 1 ## [15,] 0 0 1 1 0 2 1 0 1 2 ## [16,] 0 2 1 0 0 0 1 1 1 2 We obtain 28 inequalities, of which 8 are just a consequence of working with probabilities, 4 are the instrumental inequalities, first derived by Pearl (1995). The remaining 16 (shown above) actually involve the ACE, and can be used to partially identify it; they were derived by Balke and Pearl (1997), improving on Manski and Robins (separately). Note that these bounds can also be obtained without using the counterfactual framework, since the average causal effect has an equivalent interpretation in (for example) 9

Pearl s do-calculus framework (Cai et al., 2008). If we choose to enforce additional constraints, such as the absence of Always Takers and Defiers, we can obtain tighter bounds. nodefy = matrix(0,nrow=8,ncol=ncol(m)) nodefy[,9:16] = diag(8) M2 = rbind(m, nodefy) sign2 = c(sign, rep(0,8)) X2 = as.ieqfile(m2, sign2) X2@elimination_order = c(1:16,rep(0,9)) out2 = fmel(x2) out2@inequalities@num[,17:26] ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 1 1 1 1-1 -1-1 -1 0 0 ## [2,] 0 0 0 0 1 1 1 1 0 1 ## [3,] 0 0 0 1 0 0 0 0 0 0 ## [4,] 0 1 0 0 0 0 0 0 0 0 ## [5,] 0 0 0 0 0 0 0 0 0 0 ## [6,] 0 0 0 0 0 0 0 0 0 0 ## [7,] 0 0 0 0 0 0 0 0 0 0 ## [8,] 0 0 0 0 0 0 0 0 0 0 ## [9,] 0 0 0 0 0-1 0 0 0 0 ## [10,] 0 0 0 0 0 0-1 0 0 0 ## [11,] 0 0 0 0 0 0 0-1 0 0 ## [12,] 0 0-1 0 0 0 1 0 0 0 ## [13,] 0 0-1 0 0 0 0 1-1 0 ## [14,] 0 0 1 0 0-1 -1-1 0 0 ## [15,] 0 0 0 0 0 1 1 1 0 1 ## [16,] 0 0 1 0 0 1 0 0 1 1 Which means p XY Z (0, 1 0) + p XY Z (1, 0 0) ACE X Y 1 p XY Z (0, 1 0) p XY Z (1, 0 1), as we deduced in the previous section. Example 2.1. Applying the bounds to our lipid data, we find library(rje, warn.conflicts = FALSE) dat = array(c(158, 0, 14, 0, 52, 23, 12, 78), c(2, 2, 2)) dat2 = condition.table(dat, 1:2, 3) dat2 # p(x,y z) ##,, 1 ## 10

## [,1] [,2] ## [1,] 0.9186 0.0814 ## [2,] 0.0000 0.0000 ## ##,, 2 ## ## [,1] [,2] ## [1,] 0.3152 0.07273 ## [2,] 0.1394 0.47273 mat = out2@inequalities@num bds = mat[, 17:24] %*% c(dat2) - mat[, 26] bds[c(13, 16)] ## [1] 0.3913-0.7792 So 0.391 ÂCE X Y 0.779. If we work using the likelihood principle, everything within these bounds is indistinguishable. Of course, it s perfectly possible to place priors on these bounds. The average causal effect for compliers is gamco1 = condition.table(dat, 2, c(1, 3), c(2, 2))[2] gamnt0 = condition.table(dat, 2, c(1, 3), c(1, 2))[2] pico = condition.table(dat, 1, 3, 2)[2] pint = 1 - pico gamco0 = (condition.table(dat, 1:2, 3)[1, 2, 1] - pint * gamnt0)/pico gamco1 - gamco0 ## [1] 0.7581 This is high relative to the possible interval of overall average causal effect, because the Never Takers seem to do better than Compliers who are assigned the placebo (i.e. γ 0 CO < γ 0 NT ). 11

M X Y Figure 4: Graph representing mediation. 3 Mediators Sometimes we are not interested in the total causal effect of one quantity on another, but only the effect of a particular pathway. This particularly arises when we move from asking does it work? to how does it work?. Example 3.1. Suppose we administer an estrogen treatment X and observe it to increase the effect of cardiovascular disease (Y ). What is the mechanism for the increased risk? One hypothesis is an increase in serum lipid concentrations (M). The proposed causal pathways are shown in Figure 4 Suppose we could administer a treatment which reduced the serum lipid concentrations even after estrogen therapy: how much effect would this have on cardiovascular disease rates? In other words, how much of the effect of estrogen on CVD is mediated through serum lipid concentrations, and how much is due to other factors? Ideally of course, we d like to decompose the effect of X on Y into a direct effect, and an indirect effect through M. However, in general this is not possible, because there may be an interaction between the effects of X and M on Y. For example, suppose Y acts as an xor gate, so that Y = X + M mod 2. Even if we accept that the relation is causal, then X s direct effect on Y is either positive or negative, depending upon the state of M. This is an extreme example, and we might hope that combined effects are at least monotonic, and perhaps even approximately additive on some appropriate scale. We begin by giving two possible definitions of a direct effect, and then fixing the indirect effect to be the residual difference. Definition 3.2. The total effect of X on Y is EY (x) EY (0) = EY (x, M(x)) EY (0, M(0)). The total direct effect is the ordinary average causal effect: Alternatively the natural direct effect is TDE E Y (x, M(x)) E Y (0, M(x)). NDE E Y (x, M(0)) E Y (0, M(0)); i.e. keeping the mediator at a baseline level. Correspondingly the natural and total indirect effects are given by the differences of the total effect and relevant direct effect: NIE E Y (x, M(x)) E Y (x, M(0)) TIE E Y (x, M(0)) E Y (0, M(0)). 12

M U X Y Figure 5: outcome. Graph representing mediation with confounding between the mediator and If ignorability holds between X and Y (i.e. X Y (x, m)), then we can estimate the total causal effect of X on Y. In order to obtain the cross-world quantities such as Y (x, M(x )) where x x, we need more complex ignorability constraints. Suppose that X M(x), Y (x, m) and M(x) Y (x, m) for each x, m. This would hold if, for example, X and M were both randomized. By applying the same method as before twice, we find EY (x, m) = P(Y = 1 X = x, M = m). Then E Y (x, M(x )) = E[E[ Y (x, M(x )) M(x ) = m]] = m E[ Y (x, m) M(x ) = m] P(M(x ) = m) = m = m E[ Y (x, m)] P(M(x ) = m) P(Y = 1 X = x, M = m) P(M(x ) = m); note that the third equality uses Y (x, m) M(x ), where x is possibly different to x. This is a cross-world independence, and seems untestable even in principal. The strong ignorability assumption corresponding to mediation is that {M(x) x X } {Y (x, m) x X, m M}, a joint independence between all potential outcomes. This is plausible in the presence of randomization of some sort, but in general it is quite likely that there is confounding between the two mechanisms which generate M and Y. 3.1 Intermediate Confounding If there is unmeasured confounding between M(x) and Y (x, m), then we cannot identify the direct and indirect effects of X on Y. We can still, however, obtain semi-identification. Recall that the model in Figure 3 induces the instrumental inequalities on the observed distribution p XY Z. We can imagine that there is no direct causal effect of X on Y whatever, and apply the instrumental inequalities to find that p MY X (m, y 1 x) + p MY X (m, 1 y x) 1, m, x, y {0, 1}. Hence, if these inequalities are violated, we may deduce that there is some direct effect of X on Y. More generally, we can derive bounds: 13

idx = array(1:64, rep(2, 6)) obj = rep(0, 64) obj[c(idx[, 1, 2,,, ])] = 1 obj[c(idx[, 2,,, 2, ])] = obj[c(idx[, 2,,, 2, ])] + 1 obj[c(idx[, 1, 1,,, ])] = obj[c(idx[, 1, 1,,, ])] - 1 obj[c(idx[, 2,,, 1, ])] = obj[c(idx[, 2,,, 1, ])] - 1 trans = matrix(0, 8, 64) trans[1, c(idx[1,, 1,,, ])] = 1 trans[2, c(idx[1,, 2,,, ])] = 1 trans[3, c(idx[2,,,, 1, ])] = 1 trans[4, c(idx[2,,,, 2, ])] = 1 trans[5, c(idx[, 1,, 1,, ])] = 1 trans[6, c(idx[, 1,, 2,, ])] = 1 trans[7, c(idx[, 2,,,, 1])] = 1 trans[8, c(idx[, 2,,,, 2])] = 1 trans = rbind(cbind(trans, -diag(8), 0), cbind(diag(64), matrix(0, 64, 9))) trans = rbind(trans, c(rep(1, 64), rep(0, 9))) trans = rbind(trans, c(obj, rep(0, 8), -1)) rhs = c(rep(0, 72), 1, 0) eq = c(rep(0, 8), rep(1, 64), 0, 0) X = as.ieqfile(cbind(trans, rhs), eq) X@elimination_order = c(1:64, rep(0, 9)) out = fmel(x) out@inequalities@num[, -c(1:64)] ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 1 1 1 1-1 -1-1 -1 0 0 ## [2,] 0 0 0 0 1 1 1 1 0 1 ## [3,] 0-2 -2-2 0 0-2 -2 1-1 ## [4,] 0-1 0 0 0 0 0 0 0 0 ## [5,] 0 0-1 0 0 0 0 0 0 0 ## [6,] 0 0 0-1 0 0 0 0 0 0 ## [7,] 0 0 0 0 0-1 0 0 0 0 ## [8,] 0 0 0 0 0 0-1 0 0 0 ## [9,] 0 0 0 0 0 0 0-1 0 0 ## [10,] 0 0 0 0 0 0 0 0 1 1 ## [11,] 0 0 0 0 0 1 1 1 0 1 ## [12,] 0 1 1 1 0 0 0 0 0 1 ## [13,] 0 0 0 0 0 0 0 0-1 1 ## [14,] 0 2 0 0 0 0-2 -2-1 1 ## [15,] 0 0 2 0 0 0 2 2 1 3 ## [16,] 0 0 0 2 0 0 2 2-1 3 14

This leads to six inequalities involving the average causal effect: 1 2p X Z (1 1) + 2p XY Z (1, 0 0) max 1 2p X Z (0 1) + 2p XY Z (0, 0 0) 1 1 + 2p X Z (1 1) 2p XY Z (0, 0 0) ACE X Y min 1 + 2p X Z (0 1) 2p XY Z (1, 0 0). 1 Looking at the bounds, unsurprisingly one often finds that the inequalities are trivial and just tell us that the average causal effect is between 1 and 1. Of course, if we are willing to add in additional assumptions, we will obtain tighter bounds. M = out@inequalities@num M = M[M[, 73]!= 0, ] p = c(0.05, 0.05, 0.8, 0.1, 0.05, 0.05, 0.1, 0.8) # gives bounds of 0.4 to 1. cbind(m[, 65:72] %*% p - M[, 74], M[, 73]) ## [,1] [,2] ## [1,] -2.7 1 ## [2,] -1.0 1 ## [3,] -1.0-1 ## [4,] -2.7-1 ## [5,] 0.4 1 ## [6,] -1.0-1 3.2 Principal Stratification If M(0) = M(1) = m, then it is clear that the total effect is equal to the natural and total direct effects. In this case there is little cause to doubt that Y (1, m) Y (0, m) is the sensible measure of the direct (and total) effect of X on Y. This leads to the idea of stratifying the population based upon the values of (M i (0), M i (1)), or in other words the compliance type. For Always Takers (m = 1) and Never Takers (m = 0), the definition of the direct effect may sensibly be described as PSDE(m) = E[Y (1, m) Y (0, m) M(0) = M(1) = m]. This is the Principal Stratum Direct Effect (Frangakis and Rubin, 2002). One may posit that there is a sensible relationship between these causal effects and the direct effects for the other compliance types. It is worth noting that Principal Stratification is one of only a few approaches which has essentially no interpretation outside the potential outcomes framework. Estimation of the PSDE is somewhat non-trivial, however, since there is no way to ascertain which individuals are of which compliance type. Ignorability and monotoninicity of compliance (i.e. no defiers) is sufficient for identification. Otherwise, typical approaches are Bayesian and treat the unobserved potential outcomes as missing data. There are pitfalls with this approach, as noted above. It is not obvious how principal stratification can be extended to cases where the mediator has many categories or is continuous. 15

3.3 Parametric Relationships For dealing with continuous outcomes (or discrete outcomes with a larger state space) it is necessary to impose parametric relationships. Robust methods exist to ensure that model misspecification does not lead to incorrectly inferring a causal effect where none exists. See, for example Richardson et al. (2011); Hernán and Robins (2014). References A. Balke and J. Pearl. Bounds on treatment effects from studies with imperfect compliance. Journal of the American Statistical Association, 92(439):1171 1176, 1997. Z. Cai, M. Kuroki, J. Pearl, and J. Tian. Bounds on direct effects in the presence of confounded intermediate variables. Biometrics, 64(3):695 701, 2008. B. Efron and D. Feldman. Compliance as an explanatory variable in clinical trials. Journal of the American Statistical Association, 86(413):9 17, 1991. C. E Frangakis and D. B Rubin. Principal stratification in causal inference. Biometrics, 58(1):21 29, 2002. M. A. Hernán and J. M. Robins. Causal Inference. Chapman & Hall/CRC, 2014. See www.hsph.harvard.edu/miguel-hernan/causal-inference-book/. D. Janzing, J. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniušis, B. Steudel, and B. Schölkopf. Information-geometric approach to inferring causal directions. Artificial Intelligence, 182:1 31, 2012. J. Neyman. On the application of probability theory to agricultural experiments. essay on principles. 1923. Translated in Statistical Science, 5(4) pp 465 472, 1990. J. Pearl. On the testability of causal models with latent and instrumental variables. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (UAI- 95), pages 435 443. Morgan Kaufmann Publishers Inc., 1995. J. Pearl. Causality: models, reasoning and inference. Cambridge University Press, second edition, 2010. T. S. Richardson and J. M. Robins. Single world intervention graphs (SWIGs): A unification of the counterfactual and graphical approaches to causality. Technical Report 128, CSSS, University of Washington, 2013. T. S. Richardson, R. J. Evans, and J. M. Robins. Transparent parameterizations of models for potential outcomes. Bayesian Statistics, 9:569 610, 2011. D. B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688, 1974. 16