arxiv: v1 [stat.me] 18 Nov 2018

Size: px

Start display at page:

Download "arxiv: v1 [stat.me] 18 Nov 2018"

Todd Bond
5 years ago
Views:

1 MALTS: Matching After Learning to Stretch Harsh Parikh 1, Cynthia Rudin 1, and Alexander Volfovsky 2 1 Computer Science, Duke University 2 Statistical Science, Duke University arxiv: v1 [stat.me] 18 Nov 2018 November 20, 2018 Abstract We introduce a flexible framework for matching in causal inference that produces high quality almostexact matches. Most prior work in matching uses ad hoc distance metrics, often leading to poor quality matches, particularly when there are irrelevant covariates that degrade the distance metric. In this work, we learn an interpretable distance metric used for matching, which leads to substantially higher quality matches. The distance metric can stretch continuous covariates and matches exactly on categorical covariates. The framework is flexible in that the user can choose the form of distance metric, the type of optimization algorithm, and the type of relaxation for matching. Our ability to learn flexible distance metrics leads to matches that are interpretable and useful for estimation of conditional average treatment effects. 1 Introduction Matching methods are pervasively used throughout the social and health sciences to make causal conclusions where access to randomized control trial is scarce, but when observational data are widely available. Matching methods find groups of similar individuals, some of whom select into treatment and some of whom select into control. As such, matching methods are interpretable in that they allow fine-grained troubleshooting of the data; for instance, examining the matched groups may allow the user to locate unmeasured confounders that lead units to have a higher chance of being treated. Matching is also powerful in that it can allow the user to correctly estimate highly nonlinear treatment effects but this is only true when the matches are of high quality. The quality of the matches is our main consideration in this work. Typically, matching methods place units into the same matched group when they are close together, where closeness is measured in terms of a pre-defined distance (e.g., exact, coarsened exact, Euclidean, etc.), while maintaining balance constraints between treatment and control units. Despite its merits, this classical paradigm has serious flaws, namely that it relies heavily on a properly specified distance metric. The distance metric cannot be determined without an understanding of the importance of the variables; for instance, the quality of any prespecified distance metric that weighs all covariates equally will degrade as the number of irrelevant covariates increases. This is true no matter which matching methodology is used as the distance metric calculations could be essentially determined by these irrelevant covariates. This has previously been referred to as the toenail problem (Wang et al., 2017), where the weighing of irrelevant covariates (like toenail length) can overwhelm the metric for matching. A separate concern is that the covariates may be scaled differently, where a given distance along one covariate has a different impact than the same distance along a different covariate; in this case, the total distance metric can easily be determined by less relevant covariates, again leading to lower quality matched groups. Ideally, the distance metric would stretch to capture important covariates so that after matching, estimates of treatment effects within the matched groups would be accurate. If the researcher knows how to 1

2 choose the distance metric so that it leads to accurate treatment effect estimates, this would solve the problem. However there is no reason to believe that this is achievable in high-dimensional settings. Producing high dimensional functions to characterize data is a task for which humans alone are not naturally adept. In this work, we propose a framework for matching where an interpretable distance measure between matched units is learned from a held-out training set. As long as the distance metric generalizes from the training set to the full sample, we are able to compute high-quality matches and accurate estimates of treatment effects (conditional average treatment effects CATEs) within the matched groups. One can use any form of distance metric to train, and in this work we focus on exact matching for discrete variables and generalized Mahalanobis distances for continuous variables. By definition, the generalized Mahalanobis distance is determined by a matrix. If the matrix is diagonal, the distance calculation is a stretch for each covariate. Irrelevant covariates will be compressed so that their values are effectively always zero. Highly relevant covariates will be stretched, so that to be considered a match, the values of that covariate must be very similar. In this way, diagonal matrices lead to very interpretable distance metrics. If the Mahalanobis distance matrix is not constrained to be diagonal, then it induces a stretch and rotation, leading to more flexible but less interpretable notions of distance. The new framework is called Learning-to-Match, and the algorithm introduced in this work is called Matching After Learning to Stretch (MALTS). We tested MALTS against several other matching methods in simulation studies, where ground truth is known, and here, MALTS achieves substantially and consistently better results than other methods including Genmatch, propensity score matching, standard Mahalanobis distance and coarsened exact matching. Even though our method is heavily constrained to produce interpretable matches, it performs at the same level as non-matching methods that are designed to fit extremely flexible but uninterpretable models directly to the response surface. 2 Related work Since the 1970 s, the causal inference literature on matching methods has concentrated on dimension reduction techniques (e.g., Rubin, 1973a,b, 1976; Cochran and Rubin, 1973). The leading dimension reduction approach in this literature uses the propensity score, the conditional probability of treatment given covariate information. Propensity score methods target average treatment effects and so do not produce exact matches or almost-exact matches. When treatment is binary, they project data onto one dimension, and closeness of units in propensity score does not imply their closeness in covariate space. As a result, the matches cannot directly be used for estimating heterogeneous treatment effects. Regression methods can be used for CATE estimation, but this assumes that the regression method is correctly specified or in the case of doubly robust estimation (e.g., Farrell, 2015) either the propensity model or the outcome model needs to be correctly specified. In practice neither is likely to be correctly specified. Machine learning approaches generalize regression approaches and can create models that are extremely flexible and predict outcomes accurately for both treatment and control groups (Hahn et al., 2017; Hill, 2011; Chernozhukov et al., 2016). However this loses the interpretability inherent to almost-exact matches. In practice, MALTS performs similarly to (or better than) several machine learning methods in our experiments, despite being restricted to interpretable almost-exact matches. A flexible setup for producing high quality matches is the optimal matching literature (Rosenbaum, 2016), which started with network flow algorithms and has evolved to use integer programming to produce matches that are constrained in user-defined ways (Zubizarreta, 2012; Zubizarreta et al., 2014; Keele and Zubizarreta, 2014; Resa and Zubizarreta, 2016; Noor-E-Alam and Rudin, 2015b,a; Kallus, 2017). In all of these approaches, the user defines the distance metric rather than learning it through data, which is time consuming and likely inaccurate, leading potentially to poor quality of the matched groups. Further, integer programming methods do not scale to problems of the sizes we consider in this manuscript. An alternative to optimal matching is coarsened exact matching (Iacus et al., 2012), an approach that requires users to specify explicit bins for all covariates on which to construct almost-exact matches. This requires users to know in advance which high-dimensional bins the outcome will be insensitive to, which is essentially equivalent to requiring the user to know the answer to the problem we investigate in this work. 2

3 Large amounts of user choice to define these bins can also lead to unintentional user bias. By training the stretching rather than asking the user to define it as in CEM, this bias is reduced. 3 Learning-to-Match Framework Throughout we will write yi T (x i) for the treated potential outcome of an individual with observed pretreatment covariates x i and yi C(x i) for the control potential outcome. When notationally obvious we simplify these to y T, y C, and x. We assume that there are no unobserved confounders and that the stanadrd ignorability assumptions hold (Rubin, 2005). For each individual we define their conditional treatment effect as yi T (x i) yi C(x i). In our framework, a learning-to-match algorithm consist of three modeling decisions: the form of distance metric used for matching, the method of learning parameters of that distance metric, and the method of matching. A training set is used to train the parameters of the distance metric, and that learned distance metric is used on the rest of the sample in the test-set for matching and CATE prediction. Formally, a distance function is a map distance(x i, x k ) R that is symmetric, positive definite, and obeys the triangle inequality, where covariates x i are in R p. Our goal is to minimize the expected loss between estimated treatment effects TE(x) and true treatment effects TE(x) = E(y T (x) y C (x)) across a target population µ(x, y T, y C ) (this can either be a finite or super-population). We use hat notation ˆ for estimates. Let the expected loss be: E µ l( TE(x), TE(x)) = l( TE(x), TE(x))dµ(x, y T, y C ) µ = l(ŷ T (x) ŷ C (x), y T (x) y C (x))dµ(x, y T, y C ). µ We use squared loss, l(a, b) = (a b) 2. If we had a random i.i.d. sample {x i } i from the marginal distribution of the covariates µ(x), and if we had labels y T (x i ) and y C (x i ) we could then approximate the expected loss by the average loss over the sample: E µ l( TE(x), TE(x) 1 n l(ŷ T (x i ) ŷ C (x i ), y T (x i ) y C (x i )), n i=1 however, the difficulty in causal inference is that we only observe treatment outcomes y T (x i ) or control outcomes y C (x i ) for an individual i, so we cannot directly calculate the treatment effect. Breaking the sum into treatment and control group: E µ l( TE(x), TE(x)) 1 n t i treated l(ŷt (x i ) ŷ C (x i ), y T (x i ) y C (x i )) + 1 n c i control l(ŷt (x i ) ŷ C (x i ), y T (x i ) y C (x i )). If the target population is the treatment group (for estimating conditional average treatment effects on the treated CATT), then the sum over the control group would be removed from the quantity above. Consider one term from the treatment group. We know y T (x i ) and thus do not estimate it, so the term becomes l(y T (x i ) ŷ C (x i ), y T (x i ) y C (x i )). We still do not know y C (x i ), and must estimate it. If we use matching, we will compute the estimated control outcome ŷ C (x i ) by an average of the control outcomes within its matched group that we can observe. Let us define the matched group for unit i in terms of the observed control units indexed by k, MG(i, distance, {x C k } k, {y C k } k ) = {k : distance(x C k, x C i ) < δ}. (1) We allow overlap of matched groups in this notation, thus these are bounded-distance nearest-neighbor matches, using the specified distance metric. Thus, ŷ C (x i ) = 1 K k MG(i,distance,{x C k } k,{yk C} k) yk C, (2) 3

4 where K is the size of the matched group MG(i, distance, {x C k } k, {yk C} k). In order for matching to yield a high quality estimate ŷi C, it is sufficient (but not necessary) for the distance function to have the following property: Smoothness-of-Outcomes-for-Distance Property (SOD). A distance measure has the SOD property if for any observed values of a and b: distance(x a, x b ) < δ y T a y T b < ɛ. (3) In other words, according to the SOD condition, if we consider all units within a small distance of x a, the value of the outcome is almost constant. If this were true for all units, then we automatically have 1 K (yi C ɛ) ŷ C (x i ) 1 K k MG (yi C + ɛ) k MG y C i ɛ ŷ C (x i ) y C i + ɛ. where both summations are over k MG ( i, distance, {x C k } k, {yk C} k). Thus, the SOD condition would suffice to provide us with high quality estimates for counterfactuals, and therefore, low losses on estimates of conditional average treatment effects. Our framework learns a distance metric from a separate training set of data (not the test samples considered in the averages above), so that the distance approximately has the SOD property. Denote this training set by {x T,tr i, y T,tr i } i, {x C,tr i property, we minimize the following: distance arg, y C,tr i min distance } i. To learn distance so that its distance has an approximate SOD ( i treated + i control y tr,t i ) 2 ŷ T (x tr,t i ) ( y tr,c i ŷ C (x tr,c i ) where ŷ C (x tr,c i ) is defined by Equations (1) and (2) including its dependence on the distance, using the training data for creating matched groups, and ŷ T (x tr,t i ) is defined analogously. More generally we can write the step above as: distance arg min distance LearnDistance({xT,tr i, y T,tr i 4 Matching After Learning to Stretch ) 2, } i, {x C,tr i, y C,tr i } i, distance). (4) MALTS is an almost exact matching method for causal inference and treatment estimation designed to work on both experimental and observational datasets. MALTS performs treatment effect estimation using following three stages: 1) distance metric learning, 2) matching samples, and 3) estimating CATEs. Let us start with nearest neighbor matching using a learned weight function that depends on the distances between points. The weights for the weighted nearest neighbors matching can be learned via this specification of Eq (4): 2 W = arg min W + i control i treated y tr,c i y tr,t i j control,i j j treated,i j W i,jy tr,c j W i,jy tr,t j 2. We let W i,j be a function of the distances {distance(x i, x j )} ij. For example, the W i,j can be binary to encode whether j belongs to i s K-nearest neighbors. Alternatively they can encode soft KNN weights where W i,j = e distance(xi,xj). In the remainder of the paper we consider distance functions for W that are parameterized by a set of parameters L. 4

5 Euclidean distance in X space works best for continuous covariates. As such, one might consider distances of the form Lx a Lx b 2 where L codes the orientation of the data. An example of this in the causal inference literature is the classical Mahalanobis distance where L is hardcoded as the inverse covariance matrix for the observed Xs. This approach has been demonstrated to perform well in settings where all covariates are observed and the inferential target is the average treatment effect Stuart (2010). We are also interested in individualized treatment effects and just as the choice of Euclidean norm in Mahalanobis distance matching depends on the estimand of interest, the stretch metric needs to be amended for this new estimand. We propose learning L directly from the observed data rather than setting it up front; this is because we do not know the true distance metric, and humans are not good at creating high dimensional functions in our heads from data. The distance metric can be learned such that the objective with respect to W is minimized. For discrete (categorical/unordered) data, Euclidean distance is not a natural distance metric mixed (ordered+unordered) data poses a different set of challenges than either one alone, given the geometry of the space. Approximate closeness is ill-defined for unordered discrete covariates while exact matching is not usually possible on continuous covariates as it is unlikely for two points to have exactly the same values (assuming a continuous distribution for continuous covariates, a match would occur with almost zero probability). If we use the same distance metric for both discrete and continuous data, then units that are close in ordinal space might be arbitrarily far in categorical space. Because of this, it is not possible to parameterize a single form of distance metric to enforce both exact matching on categorical data and almost exact matching for ordinal data. While Mahalanobis distance matching papers recommend converting unordered categorical variables to binary indicators, this approach does not scale and in fact can introduce an overwhelming number of irrelevant covariates. We propose parametrizing our distance metric in terms of two components: one is a learned weighted Euclidean distance for continuous (or generally ordered) covariates while the other is learned weighted hamming distance for discrete (or generally unordered) covariates. These components are separately parameterized by matrices L c and L d respectively. Let a c, b c denote continuous covariates while a d, b d denote discrete ones. The distance metric we propose to work with is given by: distance L (a, b) = d Lc (a c, b c ) + d Ld (a d, b d ) d Lc (a c, b c ) = L c a c L c b c 2, d Ld (a d, b d ) = L (i,i) d 1[a (i) d b(i) d ], where 1[A] is the indicator that event A occurred. We thus perform almost-exact matching on the unordered covariates and learned-mahalanobis-distance matching on the rest of the covariates. We thus learn L such that we have approximately solved: 2 2 L arg min L i treatment group y tr,t i 1 K k KNN L (T,x i ) y tr,t k + a d i=0 i control group y tr,c i 1 K k KNN L (C,x i ) y tr,c k, where KNN L (B, x i ) is defined as the set of K nearest points to x i using the metric distance(x i, x k ) parameterized by L. For interpretability, we let L c be a diagonal matrix, which allows stretches or reflections of the continuous covariates. This way, the magnitude of an entry in L c or L d provides the relative importance of the indicated covariate for the causal inference problem. For simplicity in the formulation above, we assumed additive separability of ordered and unordered parts. If desired, separate stretch metrics L c can be learned for each discrete bin; a natural extension (not explored here) is to use Bayesian shrinkage to pull the stretch matrices together for the different bins. We use python scipy library s implementation of COBYLA, a non-gradient optimization method to learn the optimal L Jones et al. (01 ); Powell (1994). After learning the distance matrix on the training data, we used the new learned distance metric to predict conditional average treatment effects (CATEs) for each unit in the test-set, using nearest neighbors from the training set. For any given new point x in the p-dimensional covariate space we construct the 5

6 matched group using control set C via KNN L (C, x), and using treatment set T via KNN L (T, x). Estimated CATE for a point x is calculated via KNN L for treatment and control: ĈATE(x) = 1 ( ) y tr,t i y tr,c i. K i KNN L (T,x) i KNN L (C,x) If we were to abandon the goal of an interpretable distance function, the framework trivially generalizes to incorporate complex data structures by introducing a neural network (or other flexible learning) framework for coding the data. That is, we can redefine the distance metric via distance L (x i, x j ) = φ L (x i ), φ L (x j ) or distance L (x i, x j ) = (φ L (x i ) φ L (x j )) 2, where φ is a summary of relevant data features that is learned using a neural network or other complex modeling framework. Because deep neural networks tend to show improvements over other methods only for certain problems where latent spaces need to be constructed (computer vision, speech), we expect that the stretch/almost-exact match combination should suffice for most datasets. 5 Experiments In this section, we discuss the the performance of MALTS on both synthetically generated datasets (ordinal covariates and mixed covariates) and real world datasets. 5.1 Ordinal Covariates A variety of experimental settings were designed where the covariates were sampled from a continuous distribution. We analyzed MALTS by studying error rates in prediction of CATEs, analyzing the matched groups produced by MALTS, and pattern of change in error rate as number of covariates and number of units in dataset changes. The data generation process used for testing MALTS on continuous covariates includes quadratic treatment effect terms in addition to a linear treatment effect and linear baseline effect, as follows: Data generative process-1 (with independent covariates): x N (µ, Σ), µ = [3, 3.., 3] p 1, Σ = 0.5 I(p p) p(a i = 1) = p(a i = 1) = 1 2, β i = a i 10 2 i, α i N (1, 0.5) k k k y = β i x i + T α i x i + T x i x j. i=1 i=1 Data generative process-2 (with correlated covariates): i,j=1,j>i x N (µ, Σ), µ = [3, 3.., 3] p 1, Σ = 0.5 Σ = (1 ρ)i + ρ11 T, ρ = 0.2 p(a i = 1) = p(a i = 1) = 1 2, β i = a i 10 2 i, α i N (1, 0.5) k k k y = β i x i + T α i x i + T x i x j. i=1 i=1 i,j=1,j>i In these two DGPs we generate p covariates, with k of them contributing to the outcome (that is, there are p k irrelevant covariates). 6

7 Figure 1: Variance within matched groups tends to be smaller for important covariates. Average variance within matched groups for each covariate during the testing phase of MALTS. Variables 1, 2, and 3 are important and thus are matched more closely by MALTS, leading to lower variances within groups Variance within the Matched groups We generated 10 different training sets where p = k = 10 and n c = n t = We test the method on 10 different test datasets with n test = Recall that during the training phase, MALTS learns the distance metric in a way that stretches the more relevant covariates while compressing the irrelevant covariates in order to better predict the outcome. Because the β s of the true model are exponentially decreasing in absolute value, MALTS should ideally learn a distance metric with the largest stretch for the first covariate, second largest stretch for the second one, and so on. This ensures that the units within a matched group will be closest on the covariates that affect the outcome the most. A natural measure of covariate balance in the test set is the variability of a covariate within a matched group. After MALTS we expect that the average variance of the first covariate dimension within matched groups is the lowest, followed by the second, followed by the third, until prediction uncertainty overwhelms the (diminishing) importance of covariates. For this collection of datasets, Figure 1 plots the average variances for each covariate. We note that the variance stops increasing beyond the fourth covariate that is the matching mechanism does not necessarily distinguish between the importance of covariates above index four. As the contribution of these covariates to the outcome is less than 10% of the magnitude of the first covariate this suggests that we are likely learning an appropriate distance metric Number of covariates and size of training set In this simulation we consider the performance of our distance-metric-learning task as a function of the size of the training set and number of covariates. Using the same simulation setup as above we increase p = k, n c, and n t to study this behavior. In this simulation the ATE is given by 4.5p 2 1.5p and Figure 2 plots the average CATE absolute error for different training set sizes. For a fixed number of covariates we observe that increasing the size of the training set can reduce the absolute error in estimating the CATEs by nearly 50%. Increasing the number of covariates makes the inferrential task substantially more complex thus leading to an increase in absolute error. However, relative to the increasing size of the actual CATEs (which are on the order of the ATE), we actually see a reduction in relative error. 7

8 Figure 2: Amount of data to achieve low error depends on the number of covariates. Variation of error rate vs increasing number of covariates for datasets of different sizes Error-rate analysis Figure 3: MALTS performs well with respect to other methods. Violin plots of CATE Absolute Error on test set for a variety of methods. MALTS performs well, despite being a matching method. (a) The test data have independent covariates. (b) The test data have covariance matrix Σ = (1 ρ)i + ρ11 T where ρ = 0.2. (c) Frequency plot of CATE absolute error for results shown in the middle figure. More detailed results on the distributions are in Figure 5 and 6. In this simulation we compare MALTS with several other methods: BARTChipman et al. (2010), CRF Athey et al. (2015), difference of random forests, GenMatchDiamond and Sekhon (2013) and Propensity Score Nearest Neighbor matchingross et al. (2015). We study two different settings: one with uncorrelated and one with correlated covariates. Uncorrelated Xs : The training set contains n c = 1000 control units and n t = 1000 treated units. There are p = 10 total covariates observed, but only k = 8 associated with the outcome. Correlated Xs : The training set contains n c = n t = 500 with p = 18 total covariates observed but only k = 8 associated with the outcome. In both settings, the test set has n test = 5000 units. 8

Figure 4: Tighter matched groups are higher quality. Scatter plot of CATE estimates for different matched groups versus the reciprocal of the diameter of the matched group.

9 Figure 4: Tighter matched groups are higher quality. Scatter plot of CATE estimates for different matched groups versus the reciprocal of the diameter of the matched group. Dashed curve is a fitted exponential. One may choose to prune matched groups based on diameter to remove less reliable large diameter groups. Figure 3 compares the errors in estimating CATEs in both settings using the different methods. We note that the other matching methods are not designed for CATE estimation and so perform very poorly. On the other hand we see that MALTS performs on the order of the best modeling method, BART. Figure 4(a) is based on the the uncorrelated simulation and plots the reciprocal of the diameter of each matched group (where diameter is defined as the maximum distance of matched samples to the query sample in a matched group) versus the absolute CATE error. We note that tighter groups are of higher quality and lead to better estimation of CATEs. We can threshold and remove bad quality matched groups as function of diameter or we can weight the matched group as a function of diameter in estimation of estimands. Figures 5 and 6 provide a detailed view of the performance of the different methods. Of the interpretable matching methods it is clear that MALTS performs the best. It also outperforms less interpretable black-box methods like CRF and different of random forests. We also note that when covariates are correlated, the performance of MALTS improves even though the number of training samples is halved while the number of covariates is almost doubled. 5.2 Mixed Covariates We test MALTS on mixed data using the same the following data generative process: x c N (µ, Σ), µ = [1, 1.., 1] pc 1, Σ = 0.5 I(p c p c ) x d1,...,dpd iid Bernoulli(1/2) We consider p c = 10 continuous covariates where k c = 4 are associated with the outcome. We also consider p d = 45 discrete covariates where k c = 5 are associated with the outcome. Letting x represent the 9 important covariates and letting k = k c + k d = 9 we generate the outcomes y according to the following quadratic model: k k k y = β i x i + T α i x i + T x i x j where i=1 i=1 i,j=1,j>i p(a i = 1) = p(a i = 1) = 1 2, β i = N (a i 10, 1), α i N (1, 0.5). 9

10 (a) MALTS (b) Genmatch (c) Nearest Neighbor Matching (d) Difference of two BARTs (e) Causal Forest (f) Difference of two RF Figure 5: Another view of performance of different methods, more detailed than Figure 3 (a). Scatter plots for true value of CATEs vs predicted CATEs for different methods where all the covariates are independent. Figure 7 provides a summary of the errors in estimating CATEs across the different methods. We note that MALTS continues to perform as well as methods that directly model the outcome while outperforming the more interpretable matching methods. Figure 8 provides an in depth view of these approaches. In particular, we note that MALTS is better able to adapt to the unimportant discrete covariates than both CRF and the different of two random forests. 5.3 Real Dataset: Lalonde In this section we study the performance of MALTS on the classical Lalonde dataset. The data describe the National Support Work Demonstration (NSW) temporary employment program and its effect on income level of the participants LaLonde (1986). This dataset is frequently used as a benchmark for the performance of methods for observational causal inference. We employ the male sub-sample from the NSW in our analysis as well as the PSID-2 control sample (of male household heads under age 55 who did not classify themselves as retired in 1975 and who were not working when surveyed in the spring of 1976) Dehejia and Wahba (1999b). The outcome variable for both experimental and observational analyses is earnings in 1978 and the considered variables are age, education, whether a respondent is black, is Hispanic, is married, has a degree, and their earnings in 1975 Dehejia and Wahba (1999a). Previously, it has been demonstrated that almost any adjustment during the analysis of the experimental and observational variants of these data (both by modeling the outcome and by modeling the treatment variable) can lead to extreme bias in the estimate of average treatment effects. Tables 1 and 2 present the average treatment effect estimates based on MALTS and state of the art modeling and matching methods. We note that MALTS (after thresholding for high diameter matched group according to the procedure described in Section 5.1.3) is able to achieve 10

11 (a) MALTS (b) Genmatch (c) Nearest Neighbor Matching (d) Difference of two BARTs (e) Causal Forest (f) Difference of two RF Figure 6: Another view of performance of different methods, more detailed than Figure 3 (b). Scatter plots for true value of CATEs vs predicted CATEs for different methods where covariance matrix Σ = (1 ρ)i +ρ11 T where ρ = 0.2. accurate ATE estimation based on both experimental and observational datasets while not discarding too many individuals. As MALTS is a method for constructing interpretable matched groups (which allow for proper CATE estimation) we present sample output matched groups of MALTS for the observational Lalonde dataset in Table 3. At the top of the table we present the learned stretches for the distance metric (L) and note that matching on age appears to be extremely important, followed by education and income in We present two individuals for whom we want to construct matched groups: Query 1 is an 18 year old individual with no income in We are able to construct a tight matched group for this individual (both in control and in treatment). Query 2 is a 17-year-old high income individual, which is an extremely unlikely scenario, leading to a matched group with a very large diameter, which should probably not be used during analysis. 11

12 Figure 7: MALTS performs well on mixed-data. Violin plots of CATE Absolute Error on test set for variety of methods on dataset with mixed and independent covariates. (a) MALTS (b) Genmatch (c) Nearest Neighbor Matching (d) Difference of two BARTs (e) Causal Forest (f) Difference of two RF Figure 8: Scatter plots for true value of CATEs vs predicted CATEs for different methods where all the covariates are independent 12

13 Table 1: Predicted ATE for different methods on full Lalonde s experimental dataset. MALTS uses e 0.15 diameter(matched group) as the weights for each matched group in weighted estimation of ATE. In this setting NN matching is done with replacement. Method ATE Estimate Estimation Bias (%) Number of units matched Truth MALTS (Weighted) CRF BART Genmatch Nearest Neighbor Table 2: Predicted ATE for different methods on Lalonde s observation PSID-2 control dataset and experimental treatment dataset. MALTS uses e diameter(matched group) as the weights for each matched group in weighted estimation of ATE. In this setting NN matching is done with replacement. Method ATE Estimate Estimation Bias (%) Number of units matched Truth MALTS (Weighted) CRF BART Genmatch Nearest Neighbor Figure 9: Scatter Plot showing variation of CATEs with reciprocal of diameter (the surrogate for quality of matched groups) for Lalonde observational PSID-2 dataset. The red line shows an abrupt change in variance of predicted CATEs. The red line can also be used as a threshold to prune bad quality matches to estimate ATE. 13

14 Table 3: Example of Matched Groups on Lalonde Experimental treatment and Observational control Datasets for two example query points drawn from same datasets. Query 1 represents a high quality (low diameter) matched group while Query 2 represents a poor quality (high diameter) matched group that should be discarded during analysis. Age Education Black Hispanic Married Nodegree re75 L Age Education Black Hispanic Married Nodegree re75 re78 T Query Mean Query Mean

15 6 Conclusion and Discussion This paper introduced MALTS, a new method for causal inference and treatment effect estimation. MALTS learns a metric on the covariate space that puts more weight on important covariates and so produces high quality matched groups for analysis. MALTS is able to deal with irrelevant covariates by downweighting their importance in the weighted nearest neighbor algorithm that produces the matches. Unlike other competitive black-box methods including BART, Causal Forest or difference of two random forests, MALTS produces interpretable matched groups and also returns the stretch matrix highlighting relevant covariates. A natural extension to the introduced MALTS framework is to use neural networks or support vector machines to learn a flexible distance metric in a latent space, thus allowing us to match on images and text documents. This setup can be extended to studying treatment effect in texts or images. The MALTS framework can further be extended to deal with missing covariates, and can be adapted to instrumental variables, which is an ongoing effort. References Athey, S., Eckles, D., and Imbens, G. W. (2015). Exact p-values for network interference. Technical report, National Bureau of Economic Research. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., and Newey, W. K. (2016). Double machine learning for treatment and causal parameters. Technical report, cemmap working paper, Centre for Microdata Methods and Practice. Chipman, H. A., George, E. I., and Mcculloch, R. E. (2010). Bart: Bayesian additive regression trees. Annals of Applied Statistics, pages Cochran, W. G. and Rubin, D. B. (1973). Controlling bias in observational studies: A review. Sankhyā: The Indian Journal of Statistics, Series A, pages Dehejia, R. and Wahba, S. (1999a). Lalonde data. [Online; accessed ]. Dehejia, R. H. and Wahba, S. (1999b). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association, 94(448): Diamond, A. and Sekhon, J. S. (2013). Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. The Review of Economics and Statistics, 95(3): Farrell, M. H. (2015). Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics, 189(1):1 23. Hahn, P. R., Murray, J. S., and Carvalho, C. M. (2017). Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1): Iacus, S. M., King, G., and Porro, G. (2012). Causal inference without balance checking: Coarsened exact matching. Political analysis, 20(1):1 24. Jones, E., Oliphant, T., Peterson, P., et al. (2001 ). SciPy: Open source scientific tools for Python. [Online; accessed ]. Kallus, N. (2017). A Framework for Optimal Matching for Causal Inference. In Singh, A. and Zhu, J., editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages , Fort Lauderdale, FL, USA. PMLR. 15

16 Keele, L. and Zubizarreta, J. R. (2014). Optimal multilevel matching in clustered observational studies: A case study of the school voucher system in chile. arxiv preprint arxiv: LaLonde, R. J. (1986). Evaluating the Econometric Evaluations of Training Programs with Experimental Data. American Economic Review, 76(4): Noor-E-Alam, M. and Rudin, C. (2015a). experiments. Working paper. Robust nonparametric testing for causal inference in natural Noor-E-Alam, M. and Rudin, C. (2015b). Robust testing for causal inference in natural experiments. Working paper. Powell, M. J. D. (1994). A Direct Search Optimization Method That Models the Objective and Constraint Functions by Linear Interpolation, pages Springer Netherlands, Dordrecht. Resa, M. and Zubizarreta, J. R. (2016). balance. Statistics in Medicine. Evaluation of subset matching methods and forms of covariate Rosenbaum, P. R. (2016). Imposing minimax and quantile constraints on optimal matching in observational studies. Journal of Computational and Graphical Statistics, (just-accepted). Ross, M. E., Kreider, A. R., Huang, Y.-S., Matone, M., Rubin, D. M., and Localio, A. R. (2015). Propensity score methods for analyzing observational data like randomized experiments: challenges and solutions for rare outcomes and exposures. American journal of epidemiology, 181(12): Rubin, D. B. (1973a). Matching to remove bias in observational studies. Biometrics, pages Rubin, D. B. (1973b). The use of matched sampling and regression adjustment to remove bias in observational studies. Biometrics, pages Rubin, D. B. (1976). Multivariate matching methods that are equal percent bias reducing, i: Some examples. Biometrics, pages Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100: Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statist. Sci., 25(1):1 21. Wang, T., Roy, S., Rudin, C., and Volfovsky, A. (2017). Flame: A fast large-scale almost matching exactly approach to causal inference. arxiv preprint arxiv: Zubizarreta, J. R. (2012). Using mixed integer programming for matching in an observational study of kidney failure after surgery. Journal of the American Statistical Association, 107(500): Zubizarreta, J. R., Paredes, R. D., and Rosenbaum, P. R. (2014). Matching for balance, pairing for heterogeneity in an observational study of the effectiveness of for-profit and not-for-profit high schools in chile. The Annals of Applied Statistics, 8(1):

Achieving Optimal Covariate Balance Under General Treatment Regimes

Achieving Optimal Covariate Balance Under General Treatment Regimes Achieving Under General Treatment Regimes Marc Ratkovic Princeton University May 24, 2012 Motivation For many questions of interest in the social sciences, experiments are not possible Possible bias in