Matching to estimate the causal effects from multiple treatments

Size: px

Start display at page:

Download "Matching to estimate the causal effects from multiple treatments"

Beatrice Cooper
5 years ago
Views:

1 Matching to estimate the causal effects from multiple treatments Michael J. Lopez, and Roee Gutman Department of Biostatistics, Box G-S121-7, 121 S. Main Street, Providence, RI michael roee Abstract: The propensity score is a common tool for estimating the causal effect of a binary treatment using observational data. In this setting, matched methods, defined as either individual matching, subclassifying, or using inverse probability weighting on the propensity score, can reduce the initial covariate bias between the treatment and control groups. With more than two treatment options, however, matched methods require additional assumptions and techniques, the implementations of which have varied across disciplines, including public health and economics. This paper blends current methods together, identifying and contrasting the treatment effects each one estimates. Further, we propose a matching technique for use with multiple, nominal categorical treatments, and use simulations to show improved covariate similarity between those in the matched sets compared both the pre-matched cohort and to other current matching strategies. In summary, this manuscript provides a central location for those looking how to notate and use causal methods for categorical treatments. (< 150 words) Keywords and phrases: causal inference, propensity score, multiple treatments, matching, observational data. 1. Introduction The primary goal of many scientific applications is to identify causes and effects. While clinical trials are the gold standard for estimating a causal relationship, trials for certain exposures may be implausible due to logistical, ethical, or financial issues. Further, clinical trials may not be as generalizable as observational studies due to the restricted population used in the trials. When assignment to treatment is not randomized, as in observational data, those that receive one level of the treatment may differ from those that receive another with respect to variables that also influence the outcome. In such settings, establishing causes and effects requires additional assumptions and more sophisticated statistical tools. Matched methods using propensity scores are popular for estimating causal effects from observational studies with binary treatment (Rosenbaum and Rubin, 1983; Stuart, 2010). Propensity score is defined as the probability of receiving treatment conditional on a set of observed covariates. When treatment assignment is not confounded by unmeasured variables, individuals with the same propensity score can be considered subjects of a randomized trial, and differences in their outcomes provide unbiased unit-level estimates of the treatment s causal effect (Rosenbaum and Rubin, 1983). As such, the propensity score can be used to replicate the results which would have occurred in a clinical trial using observational data. Generalizations and applications of propensity score methods for multiple treatments, however, remain scattered in the literature, in large part because the advanced techniques required are unfamiliar and inaccessible. Our first aim is to coalesce existing propensity score methods previously suggested for multiple treatments, with a specific focus on methods for nominal categorical ones (e.g., a comparison of pain-killers Motrin, Advil, and Tylenol). We contrast these methods required assumptions and define the causal effects they each attempt to estimate. In doing so, potential pitfalls in the commonly used practice of applying binary propensity score tools to multiple treatments are identified. Further, we explain the elevated importance of defining a common support region when studying multiple treatments, where minor differences in the execution of certain approaches can vary the causal estimands and change the study population to which inference is generalizable. We provide a technique for generating matched sets when there are more than two treatments, supplementing our algorithm with extensive simulations to demonstrate the resulting similarity in the covariates distribution between those matched. In short, we both demonstrate proper specification of causal effects for multiple treatments and how to implement a matching algorithm to estimate these effects. In the remainder of Section 1, we introduce our notation and required assumptions. Section 2 explores existing causal methods for multiple treatments, and Sections 3 proposes a new algorithm for matching with multiple treatments. Section 4 uses simulations to contrast matched approaches, while Section 5 concludes The setting Randomized designs are attractive for estimating causal effects because proper randomization ensures that the distributions of both observed and unobserved covariates among each treatment group are equivalent in expectation. Data from observational studies, meanwhile, frequently suffer from treatment selection bias, where those who receive one treatment differ from those that receive another. For example, in a study estimating the causal effects of neighborhood choice on employment, persons who move into deprived neighborhoods differ from those who move into privileged ones on a variety of characteristics which also effect the outcome, such as socioeconomic status and education levels (Hedman and Van Ham, 2012). As such, it may be difficult to distinguish between neighborhood effects and the differences between subjects which existed before they chose their neighborhoods. MJ Lopez was supported by the National Institute for Health (IMSD Grant #R25GM083270) and the National Institute on Aging (NIA grant #F31AG046056) 1

2 Lopez & Gutman/Matching with multiple treatments 2 Matched methods attempt to replicate the covariate balance which would have occurred had treatment assignment been randomized, where no systematic covariate differences between those receiving each treatment exist. This is done by selecting subsets of the study population which are balanced on measured covariates. Balance refers to the distribution of each covariate and all covariate interactions; the matched set chosen should be the one in which subjects at different treatments are as similar as possible, with differences in covariates between those at different treatments no more than the differences which would of occurred in a randomized experiment (Stuart and Rubin, 2008). By ensuring that those receiving different treatments are similar, matching works to eliminate the effects of treatment selection bias on causal estimates. A commonly used option to account for observed differences in the background variables between those at different treatment levels is regression adjustment, in which functions of these observed covariates are used in a model of outcome on treatment. Matching using propensity scores can provide several advantages over standard regression adjustment. First, regression adjustment is both biased and requires extrapolation when there is insufficient overlap between those at different treatments (Dehejia and Wahba, 1998, 2002). Further, causal effects from matched methods do not require the formulation of an outcome model, a critical advantage because model misspecification can lead to biased causal effects when using regression adjustment (Rubin, 1973a, 1979; Ho et al., 2007). Lastly, in many scenarios, matching methods should be used to supplement regression adjustment; the combination of subclassification with regression adjustment, for example, has been shown to work better than either method alone (Cochran and Rubin (1973); Rubin (1979); Lunceford and Davidian (2004); Hade and Lu (2013); Lopez and Gutman (2013)) Notation for binary treatment Our notation stems from framework of potential outcomes and the Rubin Causal Model (Rubin, 1975; Holland, 1986). Let Y i, X i, and T i T be the observed outcome, set of covariates, and binary treatment assignment, respectively, for each subject i = 1,..., N, with N<N, where N is the population size which is possibly infinite. For T = {t 1, t 2 }, let n 1 and n 2 be the number of subjects receiving treatments t 1 and t 2, respectively. The RCM defines Y i (T i = t) = Y i (t) as the potential outcome for subject i if the subject were to have received treatment t. Two estimands of interest are the population average treatment effect, P AT E t1,t 2, and the population average treatment effect among those treated, P AT T t1,t 2. P AT E t1,t 2 = E[Y (t 1 ) Y (t 2 )] (1) P AT T t1,t 2 = E[Y (t 1 ) Y (t 2 ) T = t 1 ] (2) Letting I(T i = t 1 ) be the indicator function for an individual receiving treatment t 1, P AT E t1,t 2 and P AT T t1,t 2 are generally estimated by using sample average treatment effects, SAT E t1,t 2 and SAT T t1,t 2, which can be estimated using the data. SAT E t1,t 2 = 1 N N (Y i (t 1 ) Y i (t 2 )) (3) i=1 SAT T t1,t 2 = 1 N (Y i (t 1 ) Y i (t 2 )) I(T i = t 1 ) (4) n 1 i=1 Because each individual receives only one treatment, only Y i (t 1 ) or Y i (t 2 ) is observed for each subject, an issue known as the fundamental problem of causal inference (Holland, 1986). As a result, two assumptions are needed to estimate (3) and (4) using the RCM. First, the Stable Unit Treatment Value Assumption (SU T V A) specifies no interference between subjects and no hidden treatment versions, entailing that the set of potential outcomes for each subject is independent of the treatment assignment of others. A second assumption is that of strong unconfoundedness, under which contrasting outcomes for subjects with the same X are unbiased estimates of the treatment s causal effect at that X. Specifically, strong unconfoundedness implies (i) P (T = t {Y (t 1 ), Y (t 2 )}, X) = P (T = t X) and (ii) 0<P (T = t X)<1 t T (Rosenbaum and Rubin, 1983). Weaker versions of unconfoundedness are sufficient for some estimation techniques (Imbens, 2000), and will be explored in Section As the dimensions of X increase, matching on X can be either computationally intensive or impossible, and propensity scores allow inference under the RCM even in a high dimensional setting. Let e t1,t 2 (X) = P (T = t 1 X) be the propensity score (PS), with ê t1,t 2 (X) the estimated PS, traditionally calculated using logistic or probit regression. If treatment assignment is strongly unconfounded given X, then it is possible to estimate unbiased unit-level causal effects between those at different treatment assignments with equal PS s (Rosenbaum and Rubin, 1983). Propensity scores are often used for one-to-one or one-to-many matching (generally to estimate (4)) or via inverse probability weighting or subclassification (to estimate (3)), the decision of which should lie in the hands of the investigators (Stuart, 2010) Matching with binomial treatment Before looking at a multiple treatment setting, we first differentiate how estimation of the P AT T t1,t 2 for a binomial treatment is driven by the study population which is matched. Figure 1 shows different sets of covariate overlap between those receiving t 1 and

3 Lopez & Gutman/Matching with multiple treatments 3 t 2 t 1 t 1 t 2 t 1 t 2 Scenario 1a Scenario 1b Scenario 1c Fig 1: Three scenarios of covariate overlap for binary treatment: shaded areas represent subjects included in a matched analysis t 2. Each circle in Figure 1 represents a hypothetical distribution of X among those on each treatment, allowing for an infinitesimally small number of subjects outside of it. For example, each circle could represent the 99th percentile of a two-dimensional multivariate normal distribution. In Figure 1, shaded regions correspond to the subjects receiving t 1 who were matched to a subject receiving t 2. When all subjects receiving t 1 have at least one subject to be matched to among those receiving t 2, the population matched reflects the covariates distribution of those receiving treatment t 1, making P AT T t1,t 2 generalizable to the population of those treated on t 1 (Scenario 1a). In Scenario 1b, P AT T t1,t 2 also intends to reflect the population receiving t 1. However, because subjects receiving t 1 without obvious matches are also included in the analysis, there may be non-random differences in the covariates among those matched, yielding biased causal effects. Moreover, it is impossible to approximate this estimand without making unassailable assumptions with respect to those falling outside the common support. A common step to handle these issues is to eliminate extreme propensity scores using a common support region, where those with either X or ê (t1,t 2)(X) beyond the range of X or ê (t1,t 2)(X) of those on the other treatment are excluded from the analysis phase (Dehejia and Wahba, 1998). It is also possible to reduce differences between the subjects matched by using a caliper in the matching procedure, where subjects are dropped if no eligible matches have similar ê t1,t 2 (X) (Caliendo and Kopeinig, 2008; Stuart, 2010). With either caliper matching or use of a common support region, the treatment effect generalizes to those receiving t 1 who were eligible to be treated with treatment t 2 (Figure 1, Scenario 1c). Let E 1 be an indicator for having a propensity score within the common support, such that E 1 = { 1 if e (t1,t 2)(X) e (t1,t 2)(X, T = t 1 ) e (t1,t 2)(X, T = t 2 ) 0 if otherwise Here, the specific estimand of interest changes to P AT T E1(t 1,t 2) = E[Y (t 1 ) Y (t 2 ) T = t 1, E 1 = 1], the average treatment effect among those receiving t 1 who were eligible to have received t 2, with eligibility, E 1, defined by having a propensity score within the common support region Notation for multiple treatments The choices of population subsets to which inference is generalizable grow with increasing treatment choice. Let T = {t 1, t 2,..., t Z } be the treatment support for Z total treatments, with Y i = {Y i (t 1 ), Y i (t 2 ),..., Y i (t Z )} the set of potential outcomes for subject i, and let P AT E s and P AT T s refer to the class of estimands attempting to replicate effects (1) and (2) in the multiple treatment setting. There are ( Z 2) possible P AT E s of interest, one for every pairwise comparison of treatments, such that P AT E t1,t 2 = E[Y (t 1 ) Y (t 2 )] {t 1, t 2 } T, t 1 t 2 (5) P AT E s for multiple treatments are transitive P AT E t1,t 2 + P AT E t2,t 3 = P AT E t1,t 3. To contrast two treatments among subsets of the population which received one of the Z treatments, formulation of P AT T s for multiple treatments requires extended notation. For example, let P AT T t1 t 1,t 2 be the population average treatment effect of t 1 versus t 2 among those receiving t 1. With a reference treatment group t 1, there would traditionally be Z 1 P AT T s to estimate, one for each of the treatments which the reference group did not receive (McCaffrey et al., 2013). P AT T Et1 t 1,t 2 = E[Y (t 1 ) Y (t 2 ) T = t 1 ] (6) P AT T t1 t 1,t 3 = E[Y (t 1 ) Y (t 3 ) T = t 1 ] (7)... =... P AT T t1 t 1,t Z = E[Y (t 1 ) Y (t Z ) T = t 1 ] (8)

4 Lopez & Gutman/Matching with multiple treatments 4 Conditional on receiving one specific treatment, P AT T s are transitive, such that P AT T t1 t 1,t 2 +P AT T t1 t 1,t 3 = P AT T t1 t 2,t 3, but this property generally does not extend when conditioning upon receiving a different treatment. For example, unless the covariates distribution of those receiving treatments t 1 and t 2 are identical, P AT T t1 t 1,t 2 + P AT T t2 t 2,t 3 P AT T t1 t 1,t 3. To estimate P AT E s and P AT T s across exposure pairs, assumptions are expanded as follows. First, the SUTVA expands across a subject s vector of potential outcomes. Second, a strongly ignorable treatment assignment for multiple exposures entails (i) P r[y T = t, X] = P r[y X] and (ii) 0<P r[t = t X] t T (Imai and Van Dyk, 2004) The generalized propensity score The generalized propensity score (GPS), r(t, X) = P r(t = t X = x), extends the PS from a binary to the multiple treatment setting (Imbens, 2000; Imai and Van Dyk, 2004). Unlike the PS, which allows for matching, subclassification, or weighting on a scalar, conditioning with multiple treatments often must be done on a vector of GPS s, R(X) = (r(t 1, X),..., r(t Z, X)), or a function thereof (Imai and Van Dyk, 2004). Specifically, with multiple treatments, two individuals with the same r(t, X) for treatment t may have differing R(X) s. For example, for Z = 3 and T = {t 1, t 2, t 3 }, let R(X i ), R(X j ), and R(X k ) be the GPS vectors for subjects i, j, and k, respectively, where T i = t 1, T j = t 2, and T k = t 3, with R(X i ) = (0.30, 0.60, 0.10) R(X j ) = (0.30, 0.35, 0.35) R(X k ) = (0.30, 0.10, 0.60) Even though subject s i, j, and k have the same r(t 1, X) = 0.30, because these subjects r(t 2, X) and r(t 3, X) differ, differences in outcomes between these subjects would generally not provide unbiased causal effect estimates (Imbens, 2000). In part due to this limitation, Imbens (2000) called individual matching less well-suited to multiple treatment settings. Only under the scenario of R(X i ) = R(X j ) = R(X k ) would contrasts in the outcomes of subjects i, j, and k provide unbiased unit-level estimates of the causal effects between treatments (Imbens, 2000; Imai and Van Dyk, 2004). If this conditioning on equivalent GPS vectors were possible, however, and one set were available for each subject receiving treatment t 1, these comparisons could be aggregated across matched sets to estimate P AT T s, generalizable to those receiving t 1. As a result, grouping subjects with similar GPS vectors is the primary goal in estimating causal effects from multiple treatments. GPS vectors are oten estimated using the multinomial logistic, multinomial probit, and nested logit models for nominal categorical treatments, with the proportional odds model recommended with ordinal treatments (Tchernis, Horvitz-Lennon and Normand, 2005). An important point which deserves mention is an assumption made for the multinomial logistic model, called independence from irrelevant alternatives. (IIA). IIA states that the ratio of any two predicted probabilities does not depend on changes in the remaining Z 2 treatments. For example, in a setting with two conventional and two atypical antipsychotic drugs, physicians often first choose drug type (conventional or atypical) before choosing an exact prescription (Tchernis, Horvitz-Lennon and Normand, 2005). If, for example, one of the two atypical drugs were to be taken off the market, the only treatment assignment probability which might change would be that of the atypical drug remaining on the market. In these settings, the IIA assumption would be violated, in which case a nested logit or a multinomial probit treatment assignment model, ones allowing for flexible levels of pair-wise correlations, would be preferred to a multinomial logistic one (Tchernis, Horvitz-Lennon and Normand, 2005). In the sections that follow, we explain how matching methods have been implemented for both ordinal and nominal categorical treatments. One strategy we do not explore is covariate adjustment for the GPS or a function of the GPS in a regression model (Filardo et al., 2009, 2007; Dearing, McCartney and Taylor, 2009; Spreeuwenberg et al., 2010). Such techniques are subject to the same model misspecification and extrapolation problems as standard regression adjustment, and simulations have shown that these strategies perform worse than matching, stratification, or weighting (Hade and Lu, 2013) Ordinal treatments With ordinal treatments, such as ones measured on a scale (e.g. never - sometimes - always) or received as a dose (e.g. low - medium - high), it is possible to condition on a scalar balancing score in place of a GPS vector (Joffe and Rosenbaum, 1999; Imai and Van Dyk, 2004). This is done by estimating treatment assignment as a function of X using the proportional odds model (McCullagh, 1980), such that ( ) P (Ti <t) log = θ t β T X i, t = 1,..., Z 1 (9) P (T i t) Letting β T = (β 1,..β p ) T, Imai and Van Dyk (2004) showed that differences in outcomes between subjects with different exposure levels but equal β T X scores can provide unbiased unit-level estimates of causal effects at that β T X. The balancing property of β T X can be used to match or subclassify subjects receiving different levels of an ordinal exposure. Lu et al. (2001) used non-bipartite matching to form matched sets based on a function of β T X and the relative distance between

5 Lopez & Gutman/Matching with multiple treatments 5 exposure levels. While this method does not specify an exact causal estimand, it is popular for generating an overall sense of whether or not a dose-response relationship exists between T and Y. A September, 2013 google scholar search for this manuscript yielded 115 citations (including, among others, Armstrong, Jagolinzer and Larcker (2010); Frank, Akresh and Lu (2010); Snodgrass et al. (2011)). Imai and Van Dyk (2004); Zanutto, Lu and Hornik (2005); Yanovitzky, Zanutto and Hornik (2005) and Lopez and Gutman (2013) used Equation (9) to estimate treatment assignment and subclassify subjects with similar β T X values. Subclassification is attractive with ordinal exposures because individuals with roughly equivalent β T X estimates are likely to have similar distributions of X. After subclassifying, unbiased causal effects can be estimated within each subclass, and aggregated across subclasses using a weighted average to estimate either P AT E s or P AT T s (Zanutto, Lu and Hornik, 2005). Lopez and Gutman (2013) recommended that regression adjustment be used within each subclass when possible, with results likewise combined to estimate causal effects. A final strategy for estimating the causal effects of ordinal exposures is to dichotomize the treatment using a pre-specified cutoff for use with binary propensity score methods (Chertow, Normand and McNeil, 2004; Davidson et al., 2006; Schneeweiss et al., 2007). However, this procedure may result in a loss of information, as all subjects on one side of the cutoff are treated as having the same exposure level. Royston, Altman and Sauerbrei (2006) identified a loss of power, residual confounding of the treatment assignment mechanism, and possible bias in estimates as the result of dichotimization. Moreover, dichotomization makes identification of an optimal exposure level impossible. As a result, matching or subclassifications methods which maintain all exposure levels while balancing on β T X are preferred for causal inference with ordinal exposures (Imai and Van Dyk, 2004) Nominal treatments Matching on a scalar is feasible for either a binary or ordinal treatment assignment, using the propensity score and β T X from Equation (9), respectively. Such shortcuts, however, do not exist for nominal categorical treatments, where one must condition on two or more elements of a GPS vector to estimate causal effects (Imbens, 2000; Imai and Van Dyk, 2004). Because it is difficult to calculate matched sets with nominal treatments, researchers have turned to the application of binary matching tools to handle these contrasts. Two approaches, a series of binomial comparisons (Lechner, 2001) and common referent matching (Rassen et al., 2011), are discussed next. Following a description of each, we provide a simple example of how applying a binary PS approach to multiple exposures could fail to match balanced subjects Series of binomial comparisons Lechner (2001, 2002) estimated P AT T s between multiple treatments using a series of binary comparisons (SBC). With Z treatments, SBC implements binary propensity score methods within each of the ( Z 2) pairwise population subsets. For example, a treatment effect comparing t 1 to t 2 uses only subjects receiving either t 1 or t 2, with matching done on a scalar PS. Lechner advocates matching on either e (t1,t 2)(X), estimated using logistic or probit regression, or r(t 1, X)/(r(t 1, X) + r(t 2, X)), where r(t 1, X) and r(t 2, X) are estimated using a multinomial model. Figure 2 (Scenario 2a) depicts the unique common support regions for Z = 3 when using SBC, where treatment effects reflect different subsets of the population. For example, let E 2 (t 1, t 2 ) be the indicator for having a binary propensity score for treatments t 1 and t 2 within a common support: E 2 (t 1, t 2 ) = { 1 if e (t1,t 2)(X) e (t1,t 2)(X, T = t) e (t1,t 2)(X, T = t 2 ) 0 if otherwise SBC estimates the causal effect of treatment t 1 versus treatment t 2, among those on t 1, as P AT T E2(t 1 t 1,t 2), where P AT T E2(t 1 t 1,t 2) = E[Y (t 1 ) Y (t 2 ) T = t 1, E 2 (t 1, t 2 ) = 1] (10) Each pairwise treatment effect from SBC generalizes only to subjects eligible for that specific pair of treatments, as opposed to those eligible for all treatments. Such pairwise treatment effects are not transitive. For example, P AT T E2(t 1 t 1,t 2) and P AT T E2(t 1 t 1,t 3) are unlikely to provide useful comparisons of treatments t 2 and t 3 because each effect generalizes to separate subsets of subjects who received t 1 (i.e., subjects with E 2 (t 1 t 1, t 2 ) = 1 are different from those with E 2 (t 1 t 1, t 3 )=1). This lack of transitivity jeopardizes the generalizability of SBC when pairwise comparisons are used to make inference across a population. Despite this limitation, versions of SBC have been replicated often in the fields of economics, politics, and public health (Bryson, Dorsett and Purdon, 2002; Dorsett, 2006; Levin and Alvarez, 2009; Drichoutis, Lazaridis and Nayga Jr, 2005; Kosteas, 2010), with the original papers boasting 562 (Lechner, 2001) and 316 (Lechner, 2002) google scholar citations, respectively, as of March, 2014.

6 Lopez & Gutman/Matching with multiple treatments 6 t 1 t 2 t 1 t 2 t 3 Scenario 2a t 3 Scenario 2b Fig 2: Two scenarios of eligible subjects with three treatments: shaded areas represent subjects included in a matched analysis Common referent matching Rassen et al. (2011) proposed common referent matching (CRM) to create sets with one individual from each treatment type. For T = {t 1, t 2, t 3 }, the steps of CRM are as follows. First, identify t 1 as the reference treatment, where n 1 = min {n 1, n 2, n 3 }. Among those receiving each pair of treatments, {t 1, t 2 } or {t 1, t 3 }, use logistic or probit regression to estimate e t1,t 2 (X) and e t1,t 3 (X), respectively. Next, match pairs of subjects receiving t 1 or t 2 using ê t1,t 2 (X), and likewise match pairs of subjects with overlapping propensity scores on treatment t 1 or t 3 using ê t1,t 3 (X). The authors advocate 1:1 caliper matching to calculate these sets. Lastly, these two cohorts are used to construct 1:1:1 matched triplets using the patients receiving t 1 who were matched to both a patient receiving t 2 and a patient receiving t 3, along with their associated matches. Matched pairs from treatments t 1 and t 3 are discarded if the subject receiving t 1 was not matched with a subject on treatment t 2, and likewise pairs of subjects receiving t 1 and t 2 are discarded there was no match for the referral subject to someone receiving t 3. If all subjects receiving t 1 are matched to both someone in treatment groups t 2 and in t 3, then matched pairs using CRP M would be equivalent to those produced using SBC. Let E 3 be the indicator for having two pairwise binary PS s within their respective common supports, such that E 3 = CRM estimates the following treatment effects: { 1 if E 2 (t 1, t 2 ) = 1 and E 2 (t 1, t 3 ) = 1 0 if otherwise P AT T E3(t 1 t 1,t 2) = E[Y (t 1 ) Y (t 2 ) T = t 1, E 3 = 1] P AT T E3(t 1 t 1,t 3) = E[Y (t 1 ) Y (t 3 ) T = t 1, E 3 = 1] P AT T E3(t 1 t 2,t 3) = E[Y (t 2 ) Y (t 3 ) T = t 1, E 3 = 1] A possible weakness of CRM is that there is no guarantee that subjects from treatment groups t 2 and t 3, paired to the same subject receiving t 1, are equivalent across R(x). Thus, inference using these triplets to compare treatments could be confounded by residual covariate differences between the groups Evidence against a binary PS approach The following hypothetical example illustrates possible issues with implementing standard binary PS tools, as in SBC and CRM, when there are multiple treatments. Let R(X i ), R(X j ), and R(X k ) be different sets of treatment assignment vectors for subjects i, j, and k, with T i = t 1, T j = t 2, and T k = t 3, where R(X i ) = (0.33, 0.33, 0.33) R(X j ) = (0.05, 0.05, 0.90) R(X k ) = (0.49, 0.02, 0.49) Conditional on receiving either treatments t 1 or t 2, subjects i and j are equally likely to receive t 1, P (T = t 1 X i, T i {t 1, t 2 }) = P (T = t 1 X j, T j {t 1, t 2 }) = (11) Similar results hold for subjects i and k, conditional on receiving treatment t 1 or t 3. P (T = t 1 X i, T i {t 1, t 3 }) = P (T = t 1 X k, T k {t 1, t 3 }) = 0.50 (12)

7 Lopez & Gutman/Matching with multiple treatments 7 As a result, CRM could form a triplet using subjects i, j, and k, and SBC could match subjects i with j and i with k for pairwise comparisons. However, these individuals should not be grouped together because R(X i ) R(X j ) R(X k ), and as a result, the estimation of their unit-level causal effects would be biased and inappropriate (Rubin, 1973b). While such differences could average out when aggregating across several matched sets, it would only take a few poorly matched triplets to yield biased causal effects in small samples. Moreover, for even moderately non-linear relationships between Y and X, closely matched individuals, in addition to closely matched groups, are needed in order to estimate unbiased causal effects (Rubin, 1973b) Inverse probability weighting A common approach to estimate the causal effects from multiple treatments uses the inverse probability of treatment assignment as weights (IPW) (Imbens, 2000; Feng et al., 2011; McCaffrey et al., 2013). In this setting, only the assumption of weak unconfoundedness is needed (Imbens, 2000). Let I(t) = {1 if T = t, 0 otherwise}. Weak unconfoundedness states that t T, P (I(t) = 1 Y (t), X) = P (I(t) = 1 X), and allows the estimation of average outcomes within subpopulations defined by pretreatment status (Imbens, 2000). While weak unconfoundedness is a relaxed form of strong unconfoundedness, Imbens (2000) acknowledges that the contrast between the two in substantive terms...is not very different (p. 707). Feng et al. (2011) implemented IP W to estimate P AT E s between each pair of treatments, such that to contrast t 1 and t 2, P AT E t1,t 2 = E[Y (t1 )] E[Y (t 2 )] where (13) ( N )( N ) 1 E[Y (t 1 )] = and E[Y (t 2 )] = i=1 ( N i=1 I(T i = t 1 )Y i r(t 1, X i ) I(T i = t 2 )Y i r(t 2, X i ) i=1 )( N i=1 I(T i = t 1 ) r(t 1, X i ) I(T i = t 2 ) r(t 2, X i ) To estimate weights, the ordinal logistic model (Dearing, McCartney and Taylor, 2009; Crosnoe, Augustine and Huston, 2012), the multinomial probit model (Penson et al., 2003), the multinomial logistic model (Toh, García Rodríguez and Hernán, 2012), and generalized boosted models (McCaffrey et al., 2013) have all been used. A plausible issue with IP W is that extreme weights could yield erratic causal estimates with large sample variances (Little, 1988; Kang and Schafer, 2007; Stuart and Rubin, 2008), an issue which is increasingly likely for Z>2, where treatment assignment probabilities generally decrease. As in the binary treatment setting, it is possible to trim or eliminate subjects with extreme weights. However, Kilpatrick et al. (2012) found that inclusion of high weights was necessary to maintain covariate balance in the multiple treatment setting, and that removal of extreme weights could result in bias in an unknown direction. ) Matching for multiple treatments More recently, attempts have been made to more formally match on the GPS, with the goal of grouping several subjects together with similar R(X) s, including at least one subject receiving each treatment. With Z = 3, Rassen et al. (2013) proposed withintrio matching (W ithint rio) to form triplets of subjects. The goal of W ithint rio is to optimize triplet similarities based on a subject s GPS s for treatments t 1 and treatments t 2, done using a distance function between all possible pairs of triplets via the KD-tree algorithm (Hott, Brunelle and Myers, 2012). In simulations, Rassen et al. (2013) found triplets produced using W ithint rio yielded both higher, equivalent, and lower standardized covariate bias when compared to CRM and SBC. As currently constructed, W ithint rio generates n 1 matched sets in random order, for n 1 = min {n 1, n 2, n 3 }. As a result, because all subjects receiving t 1 are matched, there is the potential to form dissimilar triplets, if, for example, all close matches to a subject who received t 1 were already taken. Moreover, P AT T s generalizable to those receiving treatment t 2 or t 3 cannot be estimated using W ithint rio. Lastly, W ithint rio cannot currently handle scenarios involving more than three treatment types, and widespread public software for this algorithm s implementation remains unavailable. Lastly, to estimate P AT E s from multiple treatments, Tu, Jiao and Koh (2012) used different clustering algorithms to bin subjects into subclasses based on their R(X) s. The authors found the K means clustering (KMC) on the logit transformation of the GPS vector, where logit ( R(X) ) ( = log( r(t1, X)/(1 r(t 1, X))),..., log( r(tz, X)/(1 r(t Z, X))), generally provided the highest within subclass covariate similarity between those receiving different treatments. Although the authors do not provide guidelines regarding which subjects should be included in generating the clusters (e.g., a common support), if all subjects were subclassified, causal effects could be estimated within each subclass and then aggregated across subclasses using a weighted average to estimate either P AT E s or P AT T s. There are no known implementations on real data of K means clustering (KMC) to estimate causal effects for a nominal exposure.

8 Lopez & Gutman/Matching with multiple treatments 8 2. Matching on a vector of generalized propensity scores Problems associated with binary PS methods, limited availability and adaptability of current multivariate strategies, and the potential for extreme weights to result in erratic causal estimates motivates the need for an improved, easy to implement algorithm which can match subjects with similar GPS vectors. We propose vector matching (V M) to form matched sets for inference under multiple, nominal treatments to estimate causal effects among subjects eligible for all all treatments Eligibility criterion Whereas SBC and CRM methods may be of interest for specific research questions, in general, these effects generalize to specific subsets of a population, making them insufficient for clinicians and policy makers, who generally are looking to compare three or more active treatments at once (Rassen et al., 2011; Hott, Brunelle and Myers, 2012). Those sufficiently eligible for all treatments simultaneously, however, are more representative of the clinical trial we are hoping to replicate, in which each subject has a realistic chance of receiving all treatments. Rassen et al. (2013) refer to this balance with respect to measured covariates and trial eligibility as clinical equipoise. In this sense, both a decision to use only subjects eligible for all treatments and this definition of eligibility represent important distinctions from pairwise algorithms. We expand the common support for a binary treatment (Dehejia and Wahba, 1998) to one for multiple treatments as follows. First, estimate R(X) using a multinomial model. For each treatment t, t T, define r(t, X) (low) and r(t, X) (high) as follows r(t, X) (low) = max(min(r(t, X T = t 1 ), min(r(t, X T = t 2 ),..., min(r(t, X T = t Z ) (14) r(t, X) (high) = min(max(r(t, X T = t 1 ), max(r(t, X T = t 2 ),..., max(r(t, X T = t Z ) (15) Because subjects with r(t, X) ( r(t, X) (low), (r(t, X) (high)) may not be equivalent to the remaining study population on X, these subjects would be discarded from V M. This exclusion criteria is used t T, after which it is recommended to re-fit the GPS model, to ensure estimated GPS s are not disproportionately impacted by those dropped (suggested for binary treatment in Imbens and Rubin (2013)). Let E 4 be the indicator for all treatment eligibility, where E 4 = { 1 if r(t, X) ( r(t, X) (low), (r(t, X) (high)) t T 0 if otherwise Our goal is to estimate P AT T s among subjects eligible for all treatments. For example, with reference group t 1, we have P AT T E4 (t 1 t 1,t 2) = E[Y (t 1 ) Y (t 2 ) T = t 1, E 4 = 1] (16) P AT T E4 (t 1 t 1,t 3) = E[Y (t 1 ) Y (t 3 ) T = t 1, E 4 = 1] (17)... =... P AT T E4 (t 1 t 1,t Z ) = E[Y (t 1 ) Y (t Z ) T = t 1, E 4 = 1] (18) Because estimands (16)-(18) are transitive, P AT T E4 (t 1 t 1,t 2) and P AT T E4 (t 1 t 1,t 3), for example, could be contrasted to compare t 2 and t 3 in the population of subjects who received t 1. An additional benefit of the eligibility criterion is a lower need for extrapolation with respect to ineligible subjects. The shaded region in Figure 2 (Scenario 2b) depicts the subset of those eligible for all three treatments Vector Matching The primary downside of applying binary PS tools to match for nominal treatments is that there is no guarantee that subjects paired together are similar across their GPS vectors. If we were able to ensure that subjects could only be matched on one GPS if they were approximately equivalent to one another on remaining GPS s, binary matching tools could prove useful with multiple treatments. One common approach for generating groups of like subjects is to place them into subclasses, where subjects in the same subclass are homogeneous with respect to background covariates. Subclassification on a binary PS, for example, can remove upwards of 90% of the bias caused by heterogeneity in the pre-matched cohort under strong unconfoundedness (Rosenbaum and Rubin, 1984). With multiple treatments and GPS s, we propose subclassifying subjects using K means clustering, and matching subjects only if they appear in the same subclass. With three treatments, for example, if subjects matched on r(t 1, X) were also similar on r(t 3, X), their GPS vectors would roughly be the same. This strategy can be implemented by exact matching within each subclass (available in, among other options, the M atching package in R statistical software (Sekhon, 2008)). By exact matching, V M ensures that subjects paired to the reference treatment are both nearly perfect matches on one GPS and roughly similar on

9 Lopez & Gutman/Matching with multiple treatments 9 other GPS scores. In this sense, with a goal of obtaining perfect matches within subsets of other important scores, exact matching is similar to blocking in a randomized experiment, which has been shown to lead to large reductions in bias for binary treatment (Stuart and Rubin, 2008). In our proposed algorithm, we also match with replacement. Compared to matching on a scalar without replacement, matching with replacement has been shown to yield lower bias in matched sets (Abadie and Imbens, 2006). Moreover, because individuals not receiving the reference treatment can be matched more than once, matching with replacement allows estimation of P AT T s generalizable to each treatment group, and not just the group with the smallest sample size. With any t T = {t 1,...t Z }, vector matching (V M) estimates (16) to (18), using t as a reference, as follows 1. Model T X using a multinomial logistic model. After dropping those outside the common support (e.g., those with E 4 = 0), re-fit the model. 2. Extract R(Xi ) for each eligible subject i, i = 1,...N E4 =1, where N E4 =1 is the number of overall subjects with E 4 = t t (a) Cluster subjects receiving all other treatments besides t and t using K-means clustering (KMC) on the logit transform of remaining estimated GPS scores. This forms K strata of subjects, with roughly equivalent Z 2 GPS scores in each k K. For example, with Z = 5, T = {t 1,..t 5 }, reference treatment t 1 and letting t = t 2, V M would use KMC on logit( r(t3, X i ), r(t4, X i ), r(t5, X i )) (b) Match those receiving t to those receiving t on logit( r(t, Xi )) within each k. We match with replacement using a caliper of In our example, this matches subjects receiving t 1 to those receiving t 2 within each of the strata produced by KMC 4. Extract sets of subjects using subjects receiving t who were matched to subjects receiving all other treatments, along with their matches receiving other treatments. Up to n t1,e 4 =1 triplets are possible using VM, where n t1,e 4 =1 is the number of subjects receiving t 1 with E 4 = 1. V M is uniquely strait-forward for Z = 3: 1. Match those receiving t 1 to those receiving t 2 on logit( r(t1, X i )) within K means strata of logit( r(t3, X i )) 2. Match those receiving t 1 to those receiving t 3 on logit( r(t1, X i )) within K means strata of logit( r(t2, X i )) 3. Extract the subjects receiving t 1 who were matched to both subjects receiving t 2 and t 3 3. Simulations The goal of matching is to replicate the covariate balance which would have occurred in a completely randomized design, and success should be measured by the similarity among those matched (Stuart and Rubin, 2008). While extensive diagnostics are generally recommended in real data analyses, with binary treatment, a common standard is to check for differences in the covariates between each treatment group among the matched set. Only after the groups are judged to be balanced on the distributions of covariates can the matched cohort be used to estimate causal effects using outcome Y (Rubin, 2001). The benefits of well matched sets are plentiful; regression adjustment, for example can remove nearly 100% of the bias in causal effect estimation when used on well matched sets (Cochran and Rubin, 1973). However, when the distributions of X are asymmetric, regression adjustment on matched sets may yield erratic estimates (Cochran and Rubin, 1973). Moreover, when differences in the mean covariates between those at different treatments are a quarter of a standard deviation apart and the true outcome model is moderately non-linear, standard regression adjustment can lead to both increased bias (from 100% to 298%) and decreased bias (from 100% to -304%), the latter of which moves the bias in the other direction (see Table 1, Rubin (2001)). As a result, our primary focus lies in the balance between those matched, as measured by similarity in X, across a variety of covariate settings, and we purposely exclude Y. Matching procedures V M, CRM, and IP W are contrasted in the sections that follow; SBC is not included because (i) as a binary matching tool, it will generate similar matches to CRM, if not the identical ones, and (ii) because SBC cannot generate matched vectors, it cannot be used to contrast three or more treatments simultaneously, as in (16) to (18) Evaluating balance of matched sets by simulation We begin by generating X for different sets of N subjects receiving one of three treatments, T {t 1, t 2, t 3 }, with n 1, n 2 = γn 1, and n 3 = γ 2 n 1 the sample size of subjects receiving treatments t 1,t 2,and t 3, respectively, where X = [X 1,..., X P ] is distributed symmetrically such that:

10 Factor Lopez & Gutman/Matching with multiple treatments 10 Levels of factor n 1 {500, 2000} γ = n 2 n 1 = n 3 n 2 {1, 2} f {t 7, Normal} b B = b τ {0, 0.25} σ 2 2 {0.5, 1, 2} σ 2 3 {0.5, 1, 2} P {3, 6} Table 1 Simulation factors 1+σ 2 2 +σ2 3 3 takes levels {0, 0.25, 0.50, 0.75, 1.00} T i = t 1, i = 1,..., n 1 (19) T i = t 2, i = n 1 + 1,..., n 1 + γn 1 (20) T i = t 3, i = n 1 + γn 1,..., n 1 + γn 1 + γ 2 n 1 (21) X i {T i = t 1 } f(µ 1, Σ 1 ), i = 1,..., n 1 (22) X i {T i = t 2 } f(µ 2, Σ 2 ), i = n 1 + 1,..., n 1 + γn 1 (23) X i {T i = t 3 } f(µ 3, Σ 3 ), i = n 1 + γn 1,..., n 1 + γn 1 + γ 2 n 1 (24) µ 1 = ((b, 0, 0),..., (b, 0, 0)) T (25) µ 2 = ((0, b, 0),..., (0, b, 0)) T (26) µ 3 = ((0, 0, b),..., (0, 0, b)) T (27) 1 τ... τ τ 1... τ Σ 1 = (28) τ τ... 1 σ 2 τ... τ τ σ 2... τ Σ 2 = (29) τ τ... σ 2 σ 3 τ... τ τ σ 3... τ Σ 3 = (30) τ τ... σ 3 The means chosen in settings were chosen to mimic simulations in bivariate treatment settings, where for increasing b, subjects are increasingly likely to receive their respective treatments (as in Rubin (1979) and Cangul et al. (2009), among others). The distribution of X is varied by eight factors (Table 1). The first two factors are design features which an investigator has some control over, while the last six, for the most part, are unknown data attributes. The distance between treated and control group means is stated in terms of standardized bias B, where B = b 1+σ 2 2 +σ2 3 3 (31) in order to evaluate the effect of bias on match quality independent of variance ratios σ 2 2 and σ 2 3. Due to the small number of eligible subjects remaining when P = 6 and n 1 = 500, these simulations are discarded, leaving 1080 simulation factors. For each simulation condition, 200 data sets are generated, and on each data set, V M (using K = 5 strata), CRM (matching on binary PS with caliper 0.05), and IP W (weights generated using multinomial logistic regression) are used to fit matched sets.

11 Lopez & Gutman/Matching with multiple treatments Simulation metrics While several metrics have been proposed for evaluating the success of matching with binary treatments (see Austin, Grootendorst and Anderson (2007); Austin (2009), for example), assessments for multiple treatments are not as well formalized (Stuart, 2010). For V M or CRM, let w i be the number of times subject i is part of a triplet, and for IP W, let w i be each subject s weight, as estimated using the inverse GPS at the treatment they received (see Section 1.6.4). Let X pt be the weighted mean of covariate p, p = 1,...P, at treatment t, such that X pt = N i=1 X pii(t i = t)w i n trip (32) For each p, our interest lies in the similarity of X p1, X p2 and X p3. With binary treatment, Rubin and Thomas (1996) and Rubin (2001) suggest that SB p12 standardized bias between X p in the treatment (t 1 ) and control groups (t 2 ) should be less than 0.25, where SB k12 = X p1 X p2 δ p1 (33) and δ p1 is the standard deviation of X p in the treatment group. With multiple treatments, McCaffrey et al. (2013) uses a bias cutoff of Z! There are 2(Z 2)! pairwise standardized biases to calculate in a multiple treatment setting, which can be computationally extensive for large Z. For Z = 3, however, as in our simulations, we only have three such biases for each p, SB p12, SB p13, and SB p23. As in Hade (2012), we extract the maximum absolute standardized pairwise bias at each covariate, Max2SB p, such that Max2SB p = max( SB p12, SB p13, SB p23 ) (34) With three treatment pairs, the maximum absolute biases reflect a worst-case scenario, representing the largest discrepancy in estimated covariate means between any two treatment groups. For all matching algorithms and at each p, δ p1, the standard deviation of X p in the full sample among those receiving reference t 1, is used for standardization, to ensure that observed differences in the similarity of those matched are easily contrasted (as in Stuart and Rubin (2008)). Also extracted is %Matched, where %Matched is the fraction of subjects who received t 1 and were eligible to receive the other two treatments which were included in the final matched set. Inclusion of this metric can provide us a sense of both how many triplets V M and CRM can form relative to those eligible for all treatments, and, moreover, how similar those matched are to the population we are interested in studying. Simulations with %Matched 1 and relatively low Max2SB p p, for example, are the optimal result, where all eligible subjects who received t 1 would have been matched, with the matched subjects receiving t 2 and t 3 roughly identical to those receiving t 1 with respect to each pairwise GPS comparison. Under these results, the sets would be well-balanced, and inference would reflect the population of those receiving t 1. There are 200 independent covariate draws from each of the 1080 simulation settings. At each draw and for each each of the matching algorithms, V M, IP W, and CRM, Max2SB p p and %Matched are obtained, with these results averaged across simulation sets at each factor (%Matched is not relevant for IP W ). We simplify Max2SB p p by averaging over p, such that Max2SB = p=1,...p Max2SB p/p Determinants of matching performance ANOVA ANOVA is performed on Max2SB and %Matched using the set of results for from CRM, IP W, and V M, across all simulation factors in Table 1, as well as all of their pairwise and three-way interactions. The purposes of ANOVA are to see which factors are relevant to V M, CRM, and IP W performance, and to identify the success of each method across all conditions. High Max2SB s will suggest poorly matched triplets, while low %M atched will infer that the matched sets may not generalize to the eligible subject cohort. Initial covariate bias B drives the majority of variation in the bias remaining after matching, accounting for roughly 85%, 70%, and 45% of the variability in Max2SB for CRM, V M, and IP W, respectively (Table 2). Bias also drove nearly 100% of the variability in %Matched for V M and IP W (not shown). Compared to V M and CRM, IP W biases are substantially driven by the degrees of freedom (df) and the variance terms σ 2 and σ 3 ; skewed and highly variable covariates, which are more likely to yield extreme weights, appear to be driving covariate dissimilarity after weighting using IP W.

12 Lopez & Gutman/Matching with multiple treatments 12 Table 2 ANOVA for V M, CRM and IP W Max2SB: most influential factors V M IP W CRM Variable MSE Variable MSE Variable MSE B B B p 3092 (df) γ 4089 γ 1063 B:(df) σ t 2271 σ t 792 σ s σ s 469 σ s 680 p B:σ t 243 τ 517 n B:γ 233 B:p 490 σ t B:σ s 147 B:σ t 182 B:σ s 5280 p 64 B:σ s 158 B:p 3838 B:n 1 48 σ s:p 142 B:n (df) 28 Fig 3: Max2SB for pre-matched cohort and by matching algorithms (left), and %Matched for V M and CRM

arxiv: v2 [stat.me] 19 Jan 2017

arxiv: v2 [stat.me] 19 Jan 2017 Estimation of causal effects with multiple treatments: a review and new ideas Michael J. Lopez, and Roee Gutman Department of Mathematics, 815 N. Broadway, Saratoga Springs, NY 12866 e-mail: mlopez1@skidmore.edu.