arxiv: v1 [stat.me] 18 Nov 2018

Size: px
Start display at page:

Download "arxiv: v1 [stat.me] 18 Nov 2018"

Transcription

1 MALTS: Matching After Learning to Stretch Harsh Parikh 1, Cynthia Rudin 1, and Alexander Volfovsky 2 1 Computer Science, Duke University 2 Statistical Science, Duke University arxiv: v1 [stat.me] 18 Nov 2018 November 20, 2018 Abstract We introduce a flexible framework for matching in causal inference that produces high quality almostexact matches. Most prior work in matching uses ad hoc distance metrics, often leading to poor quality matches, particularly when there are irrelevant covariates that degrade the distance metric. In this work, we learn an interpretable distance metric used for matching, which leads to substantially higher quality matches. The distance metric can stretch continuous covariates and matches exactly on categorical covariates. The framework is flexible in that the user can choose the form of distance metric, the type of optimization algorithm, and the type of relaxation for matching. Our ability to learn flexible distance metrics leads to matches that are interpretable and useful for estimation of conditional average treatment effects. 1 Introduction Matching methods are pervasively used throughout the social and health sciences to make causal conclusions where access to randomized control trial is scarce, but when observational data are widely available. Matching methods find groups of similar individuals, some of whom select into treatment and some of whom select into control. As such, matching methods are interpretable in that they allow fine-grained troubleshooting of the data; for instance, examining the matched groups may allow the user to locate unmeasured confounders that lead units to have a higher chance of being treated. Matching is also powerful in that it can allow the user to correctly estimate highly nonlinear treatment effects but this is only true when the matches are of high quality. The quality of the matches is our main consideration in this work. Typically, matching methods place units into the same matched group when they are close together, where closeness is measured in terms of a pre-defined distance (e.g., exact, coarsened exact, Euclidean, etc.), while maintaining balance constraints between treatment and control units. Despite its merits, this classical paradigm has serious flaws, namely that it relies heavily on a properly specified distance metric. The distance metric cannot be determined without an understanding of the importance of the variables; for instance, the quality of any prespecified distance metric that weighs all covariates equally will degrade as the number of irrelevant covariates increases. This is true no matter which matching methodology is used as the distance metric calculations could be essentially determined by these irrelevant covariates. This has previously been referred to as the toenail problem (Wang et al., 2017), where the weighing of irrelevant covariates (like toenail length) can overwhelm the metric for matching. A separate concern is that the covariates may be scaled differently, where a given distance along one covariate has a different impact than the same distance along a different covariate; in this case, the total distance metric can easily be determined by less relevant covariates, again leading to lower quality matched groups. Ideally, the distance metric would stretch to capture important covariates so that after matching, estimates of treatment effects within the matched groups would be accurate. If the researcher knows how to 1

2 choose the distance metric so that it leads to accurate treatment effect estimates, this would solve the problem. However there is no reason to believe that this is achievable in high-dimensional settings. Producing high dimensional functions to characterize data is a task for which humans alone are not naturally adept. In this work, we propose a framework for matching where an interpretable distance measure between matched units is learned from a held-out training set. As long as the distance metric generalizes from the training set to the full sample, we are able to compute high-quality matches and accurate estimates of treatment effects (conditional average treatment effects CATEs) within the matched groups. One can use any form of distance metric to train, and in this work we focus on exact matching for discrete variables and generalized Mahalanobis distances for continuous variables. By definition, the generalized Mahalanobis distance is determined by a matrix. If the matrix is diagonal, the distance calculation is a stretch for each covariate. Irrelevant covariates will be compressed so that their values are effectively always zero. Highly relevant covariates will be stretched, so that to be considered a match, the values of that covariate must be very similar. In this way, diagonal matrices lead to very interpretable distance metrics. If the Mahalanobis distance matrix is not constrained to be diagonal, then it induces a stretch and rotation, leading to more flexible but less interpretable notions of distance. The new framework is called Learning-to-Match, and the algorithm introduced in this work is called Matching After Learning to Stretch (MALTS). We tested MALTS against several other matching methods in simulation studies, where ground truth is known, and here, MALTS achieves substantially and consistently better results than other methods including Genmatch, propensity score matching, standard Mahalanobis distance and coarsened exact matching. Even though our method is heavily constrained to produce interpretable matches, it performs at the same level as non-matching methods that are designed to fit extremely flexible but uninterpretable models directly to the response surface. 2 Related work Since the 1970 s, the causal inference literature on matching methods has concentrated on dimension reduction techniques (e.g., Rubin, 1973a,b, 1976; Cochran and Rubin, 1973). The leading dimension reduction approach in this literature uses the propensity score, the conditional probability of treatment given covariate information. Propensity score methods target average treatment effects and so do not produce exact matches or almost-exact matches. When treatment is binary, they project data onto one dimension, and closeness of units in propensity score does not imply their closeness in covariate space. As a result, the matches cannot directly be used for estimating heterogeneous treatment effects. Regression methods can be used for CATE estimation, but this assumes that the regression method is correctly specified or in the case of doubly robust estimation (e.g., Farrell, 2015) either the propensity model or the outcome model needs to be correctly specified. In practice neither is likely to be correctly specified. Machine learning approaches generalize regression approaches and can create models that are extremely flexible and predict outcomes accurately for both treatment and control groups (Hahn et al., 2017; Hill, 2011; Chernozhukov et al., 2016). However this loses the interpretability inherent to almost-exact matches. In practice, MALTS performs similarly to (or better than) several machine learning methods in our experiments, despite being restricted to interpretable almost-exact matches. A flexible setup for producing high quality matches is the optimal matching literature (Rosenbaum, 2016), which started with network flow algorithms and has evolved to use integer programming to produce matches that are constrained in user-defined ways (Zubizarreta, 2012; Zubizarreta et al., 2014; Keele and Zubizarreta, 2014; Resa and Zubizarreta, 2016; Noor-E-Alam and Rudin, 2015b,a; Kallus, 2017). In all of these approaches, the user defines the distance metric rather than learning it through data, which is time consuming and likely inaccurate, leading potentially to poor quality of the matched groups. Further, integer programming methods do not scale to problems of the sizes we consider in this manuscript. An alternative to optimal matching is coarsened exact matching (Iacus et al., 2012), an approach that requires users to specify explicit bins for all covariates on which to construct almost-exact matches. This requires users to know in advance which high-dimensional bins the outcome will be insensitive to, which is essentially equivalent to requiring the user to know the answer to the problem we investigate in this work. 2

3 Large amounts of user choice to define these bins can also lead to unintentional user bias. By training the stretching rather than asking the user to define it as in CEM, this bias is reduced. 3 Learning-to-Match Framework Throughout we will write yi T (x i) for the treated potential outcome of an individual with observed pretreatment covariates x i and yi C(x i) for the control potential outcome. When notationally obvious we simplify these to y T, y C, and x. We assume that there are no unobserved confounders and that the stanadrd ignorability assumptions hold (Rubin, 2005). For each individual we define their conditional treatment effect as yi T (x i) yi C(x i). In our framework, a learning-to-match algorithm consist of three modeling decisions: the form of distance metric used for matching, the method of learning parameters of that distance metric, and the method of matching. A training set is used to train the parameters of the distance metric, and that learned distance metric is used on the rest of the sample in the test-set for matching and CATE prediction. Formally, a distance function is a map distance(x i, x k ) R that is symmetric, positive definite, and obeys the triangle inequality, where covariates x i are in R p. Our goal is to minimize the expected loss between estimated treatment effects TE(x) and true treatment effects TE(x) = E(y T (x) y C (x)) across a target population µ(x, y T, y C ) (this can either be a finite or super-population). We use hat notation ˆ for estimates. Let the expected loss be: E µ l( TE(x), TE(x)) = l( TE(x), TE(x))dµ(x, y T, y C ) µ = l(ŷ T (x) ŷ C (x), y T (x) y C (x))dµ(x, y T, y C ). µ We use squared loss, l(a, b) = (a b) 2. If we had a random i.i.d. sample {x i } i from the marginal distribution of the covariates µ(x), and if we had labels y T (x i ) and y C (x i ) we could then approximate the expected loss by the average loss over the sample: E µ l( TE(x), TE(x) 1 n l(ŷ T (x i ) ŷ C (x i ), y T (x i ) y C (x i )), n i=1 however, the difficulty in causal inference is that we only observe treatment outcomes y T (x i ) or control outcomes y C (x i ) for an individual i, so we cannot directly calculate the treatment effect. Breaking the sum into treatment and control group: E µ l( TE(x), TE(x)) 1 n t i treated l(ŷt (x i ) ŷ C (x i ), y T (x i ) y C (x i )) + 1 n c i control l(ŷt (x i ) ŷ C (x i ), y T (x i ) y C (x i )). If the target population is the treatment group (for estimating conditional average treatment effects on the treated CATT), then the sum over the control group would be removed from the quantity above. Consider one term from the treatment group. We know y T (x i ) and thus do not estimate it, so the term becomes l(y T (x i ) ŷ C (x i ), y T (x i ) y C (x i )). We still do not know y C (x i ), and must estimate it. If we use matching, we will compute the estimated control outcome ŷ C (x i ) by an average of the control outcomes within its matched group that we can observe. Let us define the matched group for unit i in terms of the observed control units indexed by k, MG(i, distance, {x C k } k, {y C k } k ) = {k : distance(x C k, x C i ) < δ}. (1) We allow overlap of matched groups in this notation, thus these are bounded-distance nearest-neighbor matches, using the specified distance metric. Thus, ŷ C (x i ) = 1 K k MG(i,distance,{x C k } k,{yk C} k) yk C, (2) 3

4 where K is the size of the matched group MG(i, distance, {x C k } k, {yk C} k). In order for matching to yield a high quality estimate ŷi C, it is sufficient (but not necessary) for the distance function to have the following property: Smoothness-of-Outcomes-for-Distance Property (SOD). A distance measure has the SOD property if for any observed values of a and b: distance(x a, x b ) < δ y T a y T b < ɛ. (3) In other words, according to the SOD condition, if we consider all units within a small distance of x a, the value of the outcome is almost constant. If this were true for all units, then we automatically have 1 K (yi C ɛ) ŷ C (x i ) 1 K k MG (yi C + ɛ) k MG y C i ɛ ŷ C (x i ) y C i + ɛ. where both summations are over k MG ( i, distance, {x C k } k, {yk C} k). Thus, the SOD condition would suffice to provide us with high quality estimates for counterfactuals, and therefore, low losses on estimates of conditional average treatment effects. Our framework learns a distance metric from a separate training set of data (not the test samples considered in the averages above), so that the distance approximately has the SOD property. Denote this training set by {x T,tr i, y T,tr i } i, {x C,tr i property, we minimize the following: distance arg, y C,tr i min distance } i. To learn distance so that its distance has an approximate SOD ( i treated + i control y tr,t i ) 2 ŷ T (x tr,t i ) ( y tr,c i ŷ C (x tr,c i ) where ŷ C (x tr,c i ) is defined by Equations (1) and (2) including its dependence on the distance, using the training data for creating matched groups, and ŷ T (x tr,t i ) is defined analogously. More generally we can write the step above as: distance arg min distance LearnDistance({xT,tr i, y T,tr i 4 Matching After Learning to Stretch ) 2, } i, {x C,tr i, y C,tr i } i, distance). (4) MALTS is an almost exact matching method for causal inference and treatment estimation designed to work on both experimental and observational datasets. MALTS performs treatment effect estimation using following three stages: 1) distance metric learning, 2) matching samples, and 3) estimating CATEs. Let us start with nearest neighbor matching using a learned weight function that depends on the distances between points. The weights for the weighted nearest neighbors matching can be learned via this specification of Eq (4): 2 W = arg min W + i control i treated y tr,c i y tr,t i j control,i j j treated,i j W i,jy tr,c j W i,jy tr,t j 2. We let W i,j be a function of the distances {distance(x i, x j )} ij. For example, the W i,j can be binary to encode whether j belongs to i s K-nearest neighbors. Alternatively they can encode soft KNN weights where W i,j = e distance(xi,xj). In the remainder of the paper we consider distance functions for W that are parameterized by a set of parameters L. 4

5 Euclidean distance in X space works best for continuous covariates. As such, one might consider distances of the form Lx a Lx b 2 where L codes the orientation of the data. An example of this in the causal inference literature is the classical Mahalanobis distance where L is hardcoded as the inverse covariance matrix for the observed Xs. This approach has been demonstrated to perform well in settings where all covariates are observed and the inferential target is the average treatment effect Stuart (2010). We are also interested in individualized treatment effects and just as the choice of Euclidean norm in Mahalanobis distance matching depends on the estimand of interest, the stretch metric needs to be amended for this new estimand. We propose learning L directly from the observed data rather than setting it up front; this is because we do not know the true distance metric, and humans are not good at creating high dimensional functions in our heads from data. The distance metric can be learned such that the objective with respect to W is minimized. For discrete (categorical/unordered) data, Euclidean distance is not a natural distance metric mixed (ordered+unordered) data poses a different set of challenges than either one alone, given the geometry of the space. Approximate closeness is ill-defined for unordered discrete covariates while exact matching is not usually possible on continuous covariates as it is unlikely for two points to have exactly the same values (assuming a continuous distribution for continuous covariates, a match would occur with almost zero probability). If we use the same distance metric for both discrete and continuous data, then units that are close in ordinal space might be arbitrarily far in categorical space. Because of this, it is not possible to parameterize a single form of distance metric to enforce both exact matching on categorical data and almost exact matching for ordinal data. While Mahalanobis distance matching papers recommend converting unordered categorical variables to binary indicators, this approach does not scale and in fact can introduce an overwhelming number of irrelevant covariates. We propose parametrizing our distance metric in terms of two components: one is a learned weighted Euclidean distance for continuous (or generally ordered) covariates while the other is learned weighted hamming distance for discrete (or generally unordered) covariates. These components are separately parameterized by matrices L c and L d respectively. Let a c, b c denote continuous covariates while a d, b d denote discrete ones. The distance metric we propose to work with is given by: distance L (a, b) = d Lc (a c, b c ) + d Ld (a d, b d ) d Lc (a c, b c ) = L c a c L c b c 2, d Ld (a d, b d ) = L (i,i) d 1[a (i) d b(i) d ], where 1[A] is the indicator that event A occurred. We thus perform almost-exact matching on the unordered covariates and learned-mahalanobis-distance matching on the rest of the covariates. We thus learn L such that we have approximately solved: 2 2 L arg min L i treatment group y tr,t i 1 K k KNN L (T,x i ) y tr,t k + a d i=0 i control group y tr,c i 1 K k KNN L (C,x i ) y tr,c k, where KNN L (B, x i ) is defined as the set of K nearest points to x i using the metric distance(x i, x k ) parameterized by L. For interpretability, we let L c be a diagonal matrix, which allows stretches or reflections of the continuous covariates. This way, the magnitude of an entry in L c or L d provides the relative importance of the indicated covariate for the causal inference problem. For simplicity in the formulation above, we assumed additive separability of ordered and unordered parts. If desired, separate stretch metrics L c can be learned for each discrete bin; a natural extension (not explored here) is to use Bayesian shrinkage to pull the stretch matrices together for the different bins. We use python scipy library s implementation of COBYLA, a non-gradient optimization method to learn the optimal L Jones et al. (01 ); Powell (1994). After learning the distance matrix on the training data, we used the new learned distance metric to predict conditional average treatment effects (CATEs) for each unit in the test-set, using nearest neighbors from the training set. For any given new point x in the p-dimensional covariate space we construct the 5

6 matched group using control set C via KNN L (C, x), and using treatment set T via KNN L (T, x). Estimated CATE for a point x is calculated via KNN L for treatment and control: ĈATE(x) = 1 ( ) y tr,t i y tr,c i. K i KNN L (T,x) i KNN L (C,x) If we were to abandon the goal of an interpretable distance function, the framework trivially generalizes to incorporate complex data structures by introducing a neural network (or other flexible learning) framework for coding the data. That is, we can redefine the distance metric via distance L (x i, x j ) = φ L (x i ), φ L (x j ) or distance L (x i, x j ) = (φ L (x i ) φ L (x j )) 2, where φ is a summary of relevant data features that is learned using a neural network or other complex modeling framework. Because deep neural networks tend to show improvements over other methods only for certain problems where latent spaces need to be constructed (computer vision, speech), we expect that the stretch/almost-exact match combination should suffice for most datasets. 5 Experiments In this section, we discuss the the performance of MALTS on both synthetically generated datasets (ordinal covariates and mixed covariates) and real world datasets. 5.1 Ordinal Covariates A variety of experimental settings were designed where the covariates were sampled from a continuous distribution. We analyzed MALTS by studying error rates in prediction of CATEs, analyzing the matched groups produced by MALTS, and pattern of change in error rate as number of covariates and number of units in dataset changes. The data generation process used for testing MALTS on continuous covariates includes quadratic treatment effect terms in addition to a linear treatment effect and linear baseline effect, as follows: Data generative process-1 (with independent covariates): x N (µ, Σ), µ = [3, 3.., 3] p 1, Σ = 0.5 I(p p) p(a i = 1) = p(a i = 1) = 1 2, β i = a i 10 2 i, α i N (1, 0.5) k k k y = β i x i + T α i x i + T x i x j. i=1 i=1 Data generative process-2 (with correlated covariates): i,j=1,j>i x N (µ, Σ), µ = [3, 3.., 3] p 1, Σ = 0.5 Σ = (1 ρ)i + ρ11 T, ρ = 0.2 p(a i = 1) = p(a i = 1) = 1 2, β i = a i 10 2 i, α i N (1, 0.5) k k k y = β i x i + T α i x i + T x i x j. i=1 i=1 i,j=1,j>i In these two DGPs we generate p covariates, with k of them contributing to the outcome (that is, there are p k irrelevant covariates). 6

7 Figure 1: Variance within matched groups tends to be smaller for important covariates. Average variance within matched groups for each covariate during the testing phase of MALTS. Variables 1, 2, and 3 are important and thus are matched more closely by MALTS, leading to lower variances within groups Variance within the Matched groups We generated 10 different training sets where p = k = 10 and n c = n t = We test the method on 10 different test datasets with n test = Recall that during the training phase, MALTS learns the distance metric in a way that stretches the more relevant covariates while compressing the irrelevant covariates in order to better predict the outcome. Because the β s of the true model are exponentially decreasing in absolute value, MALTS should ideally learn a distance metric with the largest stretch for the first covariate, second largest stretch for the second one, and so on. This ensures that the units within a matched group will be closest on the covariates that affect the outcome the most. A natural measure of covariate balance in the test set is the variability of a covariate within a matched group. After MALTS we expect that the average variance of the first covariate dimension within matched groups is the lowest, followed by the second, followed by the third, until prediction uncertainty overwhelms the (diminishing) importance of covariates. For this collection of datasets, Figure 1 plots the average variances for each covariate. We note that the variance stops increasing beyond the fourth covariate that is the matching mechanism does not necessarily distinguish between the importance of covariates above index four. As the contribution of these covariates to the outcome is less than 10% of the magnitude of the first covariate this suggests that we are likely learning an appropriate distance metric Number of covariates and size of training set In this simulation we consider the performance of our distance-metric-learning task as a function of the size of the training set and number of covariates. Using the same simulation setup as above we increase p = k, n c, and n t to study this behavior. In this simulation the ATE is given by 4.5p 2 1.5p and Figure 2 plots the average CATE absolute error for different training set sizes. For a fixed number of covariates we observe that increasing the size of the training set can reduce the absolute error in estimating the CATEs by nearly 50%. Increasing the number of covariates makes the inferrential task substantially more complex thus leading to an increase in absolute error. However, relative to the increasing size of the actual CATEs (which are on the order of the ATE), we actually see a reduction in relative error. 7

8 Figure 2: Amount of data to achieve low error depends on the number of covariates. Variation of error rate vs increasing number of covariates for datasets of different sizes Error-rate analysis Figure 3: MALTS performs well with respect to other methods. Violin plots of CATE Absolute Error on test set for a variety of methods. MALTS performs well, despite being a matching method. (a) The test data have independent covariates. (b) The test data have covariance matrix Σ = (1 ρ)i + ρ11 T where ρ = 0.2. (c) Frequency plot of CATE absolute error for results shown in the middle figure. More detailed results on the distributions are in Figure 5 and 6. In this simulation we compare MALTS with several other methods: BARTChipman et al. (2010), CRF Athey et al. (2015), difference of random forests, GenMatchDiamond and Sekhon (2013) and Propensity Score Nearest Neighbor matchingross et al. (2015). We study two different settings: one with uncorrelated and one with correlated covariates. Uncorrelated Xs : The training set contains n c = 1000 control units and n t = 1000 treated units. There are p = 10 total covariates observed, but only k = 8 associated with the outcome. Correlated Xs : The training set contains n c = n t = 500 with p = 18 total covariates observed but only k = 8 associated with the outcome. In both settings, the test set has n test = 5000 units. 8

9 Figure 4: Tighter matched groups are higher quality. Scatter plot of CATE estimates for different matched groups versus the reciprocal of the diameter of the matched group. Dashed curve is a fitted exponential. One may choose to prune matched groups based on diameter to remove less reliable large diameter groups. Figure 3 compares the errors in estimating CATEs in both settings using the different methods. We note that the other matching methods are not designed for CATE estimation and so perform very poorly. On the other hand we see that MALTS performs on the order of the best modeling method, BART. Figure 4(a) is based on the the uncorrelated simulation and plots the reciprocal of the diameter of each matched group (where diameter is defined as the maximum distance of matched samples to the query sample in a matched group) versus the absolute CATE error. We note that tighter groups are of higher quality and lead to better estimation of CATEs. We can threshold and remove bad quality matched groups as function of diameter or we can weight the matched group as a function of diameter in estimation of estimands. Figures 5 and 6 provide a detailed view of the performance of the different methods. Of the interpretable matching methods it is clear that MALTS performs the best. It also outperforms less interpretable black-box methods like CRF and different of random forests. We also note that when covariates are correlated, the performance of MALTS improves even though the number of training samples is halved while the number of covariates is almost doubled. 5.2 Mixed Covariates We test MALTS on mixed data using the same the following data generative process: x c N (µ, Σ), µ = [1, 1.., 1] pc 1, Σ = 0.5 I(p c p c ) x d1,...,dpd iid Bernoulli(1/2) We consider p c = 10 continuous covariates where k c = 4 are associated with the outcome. We also consider p d = 45 discrete covariates where k c = 5 are associated with the outcome. Letting x represent the 9 important covariates and letting k = k c + k d = 9 we generate the outcomes y according to the following quadratic model: k k k y = β i x i + T α i x i + T x i x j where i=1 i=1 i,j=1,j>i p(a i = 1) = p(a i = 1) = 1 2, β i = N (a i 10, 1), α i N (1, 0.5). 9

10 (a) MALTS (b) Genmatch (c) Nearest Neighbor Matching (d) Difference of two BARTs (e) Causal Forest (f) Difference of two RF Figure 5: Another view of performance of different methods, more detailed than Figure 3 (a). Scatter plots for true value of CATEs vs predicted CATEs for different methods where all the covariates are independent. Figure 7 provides a summary of the errors in estimating CATEs across the different methods. We note that MALTS continues to perform as well as methods that directly model the outcome while outperforming the more interpretable matching methods. Figure 8 provides an in depth view of these approaches. In particular, we note that MALTS is better able to adapt to the unimportant discrete covariates than both CRF and the different of two random forests. 5.3 Real Dataset: Lalonde In this section we study the performance of MALTS on the classical Lalonde dataset. The data describe the National Support Work Demonstration (NSW) temporary employment program and its effect on income level of the participants LaLonde (1986). This dataset is frequently used as a benchmark for the performance of methods for observational causal inference. We employ the male sub-sample from the NSW in our analysis as well as the PSID-2 control sample (of male household heads under age 55 who did not classify themselves as retired in 1975 and who were not working when surveyed in the spring of 1976) Dehejia and Wahba (1999b). The outcome variable for both experimental and observational analyses is earnings in 1978 and the considered variables are age, education, whether a respondent is black, is Hispanic, is married, has a degree, and their earnings in 1975 Dehejia and Wahba (1999a). Previously, it has been demonstrated that almost any adjustment during the analysis of the experimental and observational variants of these data (both by modeling the outcome and by modeling the treatment variable) can lead to extreme bias in the estimate of average treatment effects. Tables 1 and 2 present the average treatment effect estimates based on MALTS and state of the art modeling and matching methods. We note that MALTS (after thresholding for high diameter matched group according to the procedure described in Section 5.1.3) is able to achieve 10

11 (a) MALTS (b) Genmatch (c) Nearest Neighbor Matching (d) Difference of two BARTs (e) Causal Forest (f) Difference of two RF Figure 6: Another view of performance of different methods, more detailed than Figure 3 (b). Scatter plots for true value of CATEs vs predicted CATEs for different methods where covariance matrix Σ = (1 ρ)i +ρ11 T where ρ = 0.2. accurate ATE estimation based on both experimental and observational datasets while not discarding too many individuals. As MALTS is a method for constructing interpretable matched groups (which allow for proper CATE estimation) we present sample output matched groups of MALTS for the observational Lalonde dataset in Table 3. At the top of the table we present the learned stretches for the distance metric (L) and note that matching on age appears to be extremely important, followed by education and income in We present two individuals for whom we want to construct matched groups: Query 1 is an 18 year old individual with no income in We are able to construct a tight matched group for this individual (both in control and in treatment). Query 2 is a 17-year-old high income individual, which is an extremely unlikely scenario, leading to a matched group with a very large diameter, which should probably not be used during analysis. 11

12 Figure 7: MALTS performs well on mixed-data. Violin plots of CATE Absolute Error on test set for variety of methods on dataset with mixed and independent covariates. (a) MALTS (b) Genmatch (c) Nearest Neighbor Matching (d) Difference of two BARTs (e) Causal Forest (f) Difference of two RF Figure 8: Scatter plots for true value of CATEs vs predicted CATEs for different methods where all the covariates are independent 12

13 Table 1: Predicted ATE for different methods on full Lalonde s experimental dataset. MALTS uses e 0.15 diameter(matched group) as the weights for each matched group in weighted estimation of ATE. In this setting NN matching is done with replacement. Method ATE Estimate Estimation Bias (%) Number of units matched Truth MALTS (Weighted) CRF BART Genmatch Nearest Neighbor Table 2: Predicted ATE for different methods on Lalonde s observation PSID-2 control dataset and experimental treatment dataset. MALTS uses e diameter(matched group) as the weights for each matched group in weighted estimation of ATE. In this setting NN matching is done with replacement. Method ATE Estimate Estimation Bias (%) Number of units matched Truth MALTS (Weighted) CRF BART Genmatch Nearest Neighbor Figure 9: Scatter Plot showing variation of CATEs with reciprocal of diameter (the surrogate for quality of matched groups) for Lalonde observational PSID-2 dataset. The red line shows an abrupt change in variance of predicted CATEs. The red line can also be used as a threshold to prune bad quality matches to estimate ATE. 13

14 Table 3: Example of Matched Groups on Lalonde Experimental treatment and Observational control Datasets for two example query points drawn from same datasets. Query 1 represents a high quality (low diameter) matched group while Query 2 represents a poor quality (high diameter) matched group that should be discarded during analysis. Age Education Black Hispanic Married Nodegree re75 L Age Education Black Hispanic Married Nodegree re75 re78 T Query Mean Query Mean

15 6 Conclusion and Discussion This paper introduced MALTS, a new method for causal inference and treatment effect estimation. MALTS learns a metric on the covariate space that puts more weight on important covariates and so produces high quality matched groups for analysis. MALTS is able to deal with irrelevant covariates by downweighting their importance in the weighted nearest neighbor algorithm that produces the matches. Unlike other competitive black-box methods including BART, Causal Forest or difference of two random forests, MALTS produces interpretable matched groups and also returns the stretch matrix highlighting relevant covariates. A natural extension to the introduced MALTS framework is to use neural networks or support vector machines to learn a flexible distance metric in a latent space, thus allowing us to match on images and text documents. This setup can be extended to studying treatment effect in texts or images. The MALTS framework can further be extended to deal with missing covariates, and can be adapted to instrumental variables, which is an ongoing effort. References Athey, S., Eckles, D., and Imbens, G. W. (2015). Exact p-values for network interference. Technical report, National Bureau of Economic Research. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., and Newey, W. K. (2016). Double machine learning for treatment and causal parameters. Technical report, cemmap working paper, Centre for Microdata Methods and Practice. Chipman, H. A., George, E. I., and Mcculloch, R. E. (2010). Bart: Bayesian additive regression trees. Annals of Applied Statistics, pages Cochran, W. G. and Rubin, D. B. (1973). Controlling bias in observational studies: A review. Sankhyā: The Indian Journal of Statistics, Series A, pages Dehejia, R. and Wahba, S. (1999a). Lalonde data. [Online; accessed ]. Dehejia, R. H. and Wahba, S. (1999b). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association, 94(448): Diamond, A. and Sekhon, J. S. (2013). Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. The Review of Economics and Statistics, 95(3): Farrell, M. H. (2015). Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics, 189(1):1 23. Hahn, P. R., Murray, J. S., and Carvalho, C. M. (2017). Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1): Iacus, S. M., King, G., and Porro, G. (2012). Causal inference without balance checking: Coarsened exact matching. Political analysis, 20(1):1 24. Jones, E., Oliphant, T., Peterson, P., et al. (2001 ). SciPy: Open source scientific tools for Python. [Online; accessed ]. Kallus, N. (2017). A Framework for Optimal Matching for Causal Inference. In Singh, A. and Zhu, J., editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages , Fort Lauderdale, FL, USA. PMLR. 15

16 Keele, L. and Zubizarreta, J. R. (2014). Optimal multilevel matching in clustered observational studies: A case study of the school voucher system in chile. arxiv preprint arxiv: LaLonde, R. J. (1986). Evaluating the Econometric Evaluations of Training Programs with Experimental Data. American Economic Review, 76(4): Noor-E-Alam, M. and Rudin, C. (2015a). experiments. Working paper. Robust nonparametric testing for causal inference in natural Noor-E-Alam, M. and Rudin, C. (2015b). Robust testing for causal inference in natural experiments. Working paper. Powell, M. J. D. (1994). A Direct Search Optimization Method That Models the Objective and Constraint Functions by Linear Interpolation, pages Springer Netherlands, Dordrecht. Resa, M. and Zubizarreta, J. R. (2016). balance. Statistics in Medicine. Evaluation of subset matching methods and forms of covariate Rosenbaum, P. R. (2016). Imposing minimax and quantile constraints on optimal matching in observational studies. Journal of Computational and Graphical Statistics, (just-accepted). Ross, M. E., Kreider, A. R., Huang, Y.-S., Matone, M., Rubin, D. M., and Localio, A. R. (2015). Propensity score methods for analyzing observational data like randomized experiments: challenges and solutions for rare outcomes and exposures. American journal of epidemiology, 181(12): Rubin, D. B. (1973a). Matching to remove bias in observational studies. Biometrics, pages Rubin, D. B. (1973b). The use of matched sampling and regression adjustment to remove bias in observational studies. Biometrics, pages Rubin, D. B. (1976). Multivariate matching methods that are equal percent bias reducing, i: Some examples. Biometrics, pages Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100: Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statist. Sci., 25(1):1 21. Wang, T., Roy, S., Rudin, C., and Volfovsky, A. (2017). Flame: A fast large-scale almost matching exactly approach to causal inference. arxiv preprint arxiv: Zubizarreta, J. R. (2012). Using mixed integer programming for matching in an observational study of kidney failure after surgery. Journal of the American Statistical Association, 107(500): Zubizarreta, J. R., Paredes, R. D., and Rosenbaum, P. R. (2014). Matching for balance, pairing for heterogeneity in an observational study of the effectiveness of for-profit and not-for-profit high schools in chile. The Annals of Applied Statistics, 8(1):

Achieving Optimal Covariate Balance Under General Treatment Regimes

Achieving Optimal Covariate Balance Under General Treatment Regimes Achieving Under General Treatment Regimes Marc Ratkovic Princeton University May 24, 2012 Motivation For many questions of interest in the social sciences, experiments are not possible Possible bias in

More information

Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity

Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity P. Richard Hahn, Jared Murray, and Carlos Carvalho June 22, 2017 The problem setting We want to estimate

More information

Flexible Estimation of Treatment Effect Parameters

Flexible Estimation of Treatment Effect Parameters Flexible Estimation of Treatment Effect Parameters Thomas MaCurdy a and Xiaohong Chen b and Han Hong c Introduction Many empirical studies of program evaluations are complicated by the presence of both

More information

Selection on Observables: Propensity Score Matching.

Selection on Observables: Propensity Score Matching. Selection on Observables: Propensity Score Matching. Department of Economics and Management Irene Brunetti ireneb@ec.unipi.it 24/10/2017 I. Brunetti Labour Economics in an European Perspective 24/10/2017

More information

Bayesian causal forests: dealing with regularization induced confounding and shrinking towards homogeneous effects

Bayesian causal forests: dealing with regularization induced confounding and shrinking towards homogeneous effects Bayesian causal forests: dealing with regularization induced confounding and shrinking towards homogeneous effects P. Richard Hahn, Jared Murray, and Carlos Carvalho July 29, 2018 Regularization induced

More information

Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies

Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies Kosuke Imai Department of Politics Princeton University November 13, 2013 So far, we have essentially assumed

More information

Gov 2002: 5. Matching

Gov 2002: 5. Matching Gov 2002: 5. Matching Matthew Blackwell October 1, 2015 Where are we? Where are we going? Discussed randomized experiments, started talking about observational data. Last week: no unmeasured confounders

More information

What s New in Econometrics. Lecture 1

What s New in Econometrics. Lecture 1 What s New in Econometrics Lecture 1 Estimation of Average Treatment Effects Under Unconfoundedness Guido Imbens NBER Summer Institute, 2007 Outline 1. Introduction 2. Potential Outcomes 3. Estimands and

More information

ESTIMATION OF TREATMENT EFFECTS VIA MATCHING

ESTIMATION OF TREATMENT EFFECTS VIA MATCHING ESTIMATION OF TREATMENT EFFECTS VIA MATCHING AAEC 56 INSTRUCTOR: KLAUS MOELTNER Textbooks: R scripts: Wooldridge (00), Ch.; Greene (0), Ch.9; Angrist and Pischke (00), Ch. 3 mod5s3 General Approach The

More information

DOCUMENTS DE TRAVAIL CEMOI / CEMOI WORKING PAPERS. A SAS macro to estimate Average Treatment Effects with Matching Estimators

DOCUMENTS DE TRAVAIL CEMOI / CEMOI WORKING PAPERS. A SAS macro to estimate Average Treatment Effects with Matching Estimators DOCUMENTS DE TRAVAIL CEMOI / CEMOI WORKING PAPERS A SAS macro to estimate Average Treatment Effects with Matching Estimators Nicolas Moreau 1 http://cemoi.univ-reunion.fr/publications/ Centre d'economie

More information

Matching for Causal Inference Without Balance Checking

Matching for Causal Inference Without Balance Checking Matching for ausal Inference Without Balance hecking Gary King Institute for Quantitative Social Science Harvard University joint work with Stefano M. Iacus (Univ. of Milan) and Giuseppe Porro (Univ. of

More information

Is My Matched Dataset As-If Randomized, More, Or Less? Unifying the Design and. Analysis of Observational Studies

Is My Matched Dataset As-If Randomized, More, Or Less? Unifying the Design and. Analysis of Observational Studies Is My Matched Dataset As-If Randomized, More, Or Less? Unifying the Design and arxiv:1804.08760v1 [stat.me] 23 Apr 2018 Analysis of Observational Studies Zach Branson Department of Statistics, Harvard

More information

Causal Inference Basics

Causal Inference Basics Causal Inference Basics Sam Lendle October 09, 2013 Observed data, question, counterfactuals Observed data: n i.i.d copies of baseline covariates W, treatment A {0, 1}, and outcome Y. O i = (W i, A i,

More information

Variable selection and machine learning methods in causal inference

Variable selection and machine learning methods in causal inference Variable selection and machine learning methods in causal inference Debashis Ghosh Department of Biostatistics and Informatics Colorado School of Public Health Joint work with Yeying Zhu, University of

More information

Dynamics in Social Networks and Causality

Dynamics in Social Networks and Causality Web Science & Technologies University of Koblenz Landau, Germany Dynamics in Social Networks and Causality JProf. Dr. University Koblenz Landau GESIS Leibniz Institute for the Social Sciences Last Time:

More information

finite-sample optimal estimation and inference on average treatment effects under unconfoundedness

finite-sample optimal estimation and inference on average treatment effects under unconfoundedness finite-sample optimal estimation and inference on average treatment effects under unconfoundedness Timothy Armstrong (Yale University) Michal Kolesár (Princeton University) September 2017 Introduction

More information

A Measure of Robustness to Misspecification

A Measure of Robustness to Misspecification A Measure of Robustness to Misspecification Susan Athey Guido W. Imbens December 2014 Graduate School of Business, Stanford University, and NBER. Electronic correspondence: athey@stanford.edu. Graduate

More information

arxiv: v4 [stat.ml] 31 Jan 2019

arxiv: v4 [stat.ml] 31 Jan 2019 FLAME: A Fast Large-scale Almost Matching Exactly Approach to Causal Inference Tianyu Wang Sudeepa Roy Cynthia Rudin Alexander Volfovsky arxiv:1707.06315v4 [stat.ml] 31 Jan 2019 Abstract A classical problem

More information

Bayesian Causal Forests

Bayesian Causal Forests Bayesian Causal Forests for estimating personalized treatment effects P. Richard Hahn, Jared Murray, and Carlos Carvalho December 8, 2016 Overview We consider observational data, assuming conditional unconfoundedness,

More information

A Minimax Surrogate Loss Approach to Causal Inference

A Minimax Surrogate Loss Approach to Causal Inference A Minimax Surrogate Loss Approach to Causal Inference Siong Thye Goh MIT Cynthia Rudin Duke University February 10, 2018 Abstract We present a new machine learning approach to estimate personalized treatment

More information

Propensity Score Analysis with Hierarchical Data

Propensity Score Analysis with Hierarchical Data Propensity Score Analysis with Hierarchical Data Fan Li Alan Zaslavsky Mary Beth Landrum Department of Health Care Policy Harvard Medical School May 19, 2008 Introduction Population-based observational

More information

An Introduction to Causal Mediation Analysis. Xu Qin University of Chicago Presented at the Central Iowa R User Group Meetup Aug 10, 2016

An Introduction to Causal Mediation Analysis. Xu Qin University of Chicago Presented at the Central Iowa R User Group Meetup Aug 10, 2016 An Introduction to Causal Mediation Analysis Xu Qin University of Chicago Presented at the Central Iowa R User Group Meetup Aug 10, 2016 1 Causality In the applications of statistics, many central questions

More information

Propensity Score Matching

Propensity Score Matching Methods James H. Steiger Department of Psychology and Human Development Vanderbilt University Regression Modeling, 2009 Methods 1 Introduction 2 3 4 Introduction Why Match? 5 Definition Methods and In

More information

Propensity Score Weighting with Multilevel Data

Propensity Score Weighting with Multilevel Data Propensity Score Weighting with Multilevel Data Fan Li Department of Statistical Science Duke University October 25, 2012 Joint work with Alan Zaslavsky and Mary Beth Landrum Introduction In comparative

More information

Imbens/Wooldridge, IRP Lecture Notes 2, August 08 1

Imbens/Wooldridge, IRP Lecture Notes 2, August 08 1 Imbens/Wooldridge, IRP Lecture Notes 2, August 08 IRP Lectures Madison, WI, August 2008 Lecture 2, Monday, Aug 4th, 0.00-.00am Estimation of Average Treatment Effects Under Unconfoundedness, Part II. Introduction

More information

Statistical Models for Causal Analysis

Statistical Models for Causal Analysis Statistical Models for Causal Analysis Teppei Yamamoto Keio University Introduction to Causal Inference Spring 2016 Three Modes of Statistical Inference 1. Descriptive Inference: summarizing and exploring

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Causal inference in multilevel data structures:

Causal inference in multilevel data structures: Causal inference in multilevel data structures: Discussion of papers by Li and Imai Jennifer Hill May 19 th, 2008 Li paper Strengths Area that needs attention! With regard to propensity score strategies

More information

Controlling for overlap in matching

Controlling for overlap in matching Working Papers No. 10/2013 (95) PAWEŁ STRAWIŃSKI Controlling for overlap in matching Warsaw 2013 Controlling for overlap in matching PAWEŁ STRAWIŃSKI Faculty of Economic Sciences, University of Warsaw

More information

An Introduction to Causal Analysis on Observational Data using Propensity Scores

An Introduction to Causal Analysis on Observational Data using Propensity Scores An Introduction to Causal Analysis on Observational Data using Propensity Scores Margie Rosenberg*, PhD, FSA Brian Hartman**, PhD, ASA Shannon Lane* *University of Wisconsin Madison **University of Connecticut

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Econ 673: Microeconometrics Chapter 12: Estimating Treatment Effects. The Problem

Econ 673: Microeconometrics Chapter 12: Estimating Treatment Effects. The Problem Econ 673: Microeconometrics Chapter 12: Estimating Treatment Effects The Problem Analysts are frequently interested in measuring the impact of a treatment on individual behavior; e.g., the impact of job

More information

studies, situations (like an experiment) in which a group of units is exposed to a

studies, situations (like an experiment) in which a group of units is exposed to a 1. Introduction An important problem of causal inference is how to estimate treatment effects in observational studies, situations (like an experiment) in which a group of units is exposed to a well-defined

More information

Background of Matching

Background of Matching Background of Matching Matching is a method that has gained increased popularity for the assessment of causal inference. This method has often been used in the field of medicine, economics, political science,

More information

Section 10: Inverse propensity score weighting (IPSW)

Section 10: Inverse propensity score weighting (IPSW) Section 10: Inverse propensity score weighting (IPSW) Fall 2014 1/23 Inverse Propensity Score Weighting (IPSW) Until now we discussed matching on the P-score, a different approach is to re-weight the observations

More information

NBER WORKING PAPER SERIES A NOTE ON ADAPTING PROPENSITY SCORE MATCHING AND SELECTION MODELS TO CHOICE BASED SAMPLES. James J. Heckman Petra E.

NBER WORKING PAPER SERIES A NOTE ON ADAPTING PROPENSITY SCORE MATCHING AND SELECTION MODELS TO CHOICE BASED SAMPLES. James J. Heckman Petra E. NBER WORKING PAPER SERIES A NOTE ON ADAPTING PROPENSITY SCORE MATCHING AND SELECTION MODELS TO CHOICE BASED SAMPLES James J. Heckman Petra E. Todd Working Paper 15179 http://www.nber.org/papers/w15179

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Causal Sensitivity Analysis for Decision Trees

Causal Sensitivity Analysis for Decision Trees Causal Sensitivity Analysis for Decision Trees by Chengbo Li A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics in Computer

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Achieving Optimal Covariate Balance Under General Treatment Regimes

Achieving Optimal Covariate Balance Under General Treatment Regimes Achieving Optimal Covariate Balance Under General Treatment Regimes Marc Ratkovic September 21, 2011 Abstract Balancing covariates across treatment levels provides an effective and increasingly popular

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Gaussian Processes (10/16/13)

Gaussian Processes (10/16/13) STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

More information

Matching. Quiz 2. Matching. Quiz 2. Exact Matching. Estimand 2/25/14

Matching. Quiz 2. Matching. Quiz 2. Exact Matching. Estimand 2/25/14 STA 320 Design and Analysis of Causal Studies Dr. Kari Lock Morgan and Dr. Fan Li Department of Statistical Science Duke University Frequency 0 2 4 6 8 Quiz 2 Histogram of Quiz2 10 12 14 16 18 20 Quiz2

More information

Matching Techniques. Technical Session VI. Manila, December Jed Friedman. Spanish Impact Evaluation. Fund. Region

Matching Techniques. Technical Session VI. Manila, December Jed Friedman. Spanish Impact Evaluation. Fund. Region Impact Evaluation Technical Session VI Matching Techniques Jed Friedman Manila, December 2008 Human Development Network East Asia and the Pacific Region Spanish Impact Evaluation Fund The case of random

More information

Comment on Article by Scutari

Comment on Article by Scutari Bayesian Analysis (2013) 8, Number 3, pp. 543 548 Comment on Article by Scutari Hao Wang Scutari s paper studies properties of the distribution of graphs ppgq. This is an interesting angle because it differs

More information

Since the seminal paper by Rosenbaum and Rubin (1983b) on propensity. Propensity Score Analysis. Concepts and Issues. Chapter 1. Wei Pan Haiyan Bai

Since the seminal paper by Rosenbaum and Rubin (1983b) on propensity. Propensity Score Analysis. Concepts and Issues. Chapter 1. Wei Pan Haiyan Bai Chapter 1 Propensity Score Analysis Concepts and Issues Wei Pan Haiyan Bai Since the seminal paper by Rosenbaum and Rubin (1983b) on propensity score analysis, research using propensity score analysis

More information

ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 1. Introduction 2. The Sharp RD Design 3.

More information

Classification and Pattern Recognition

Classification and Pattern Recognition Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations

More information

Gov 2002: 4. Observational Studies and Confounding

Gov 2002: 4. Observational Studies and Confounding Gov 2002: 4. Observational Studies and Confounding Matthew Blackwell September 10, 2015 Where are we? Where are we going? Last two weeks: randomized experiments. From here on: observational studies. What

More information

Bayesian Additive Regression Tree (BART) with application to controlled trail data analysis

Bayesian Additive Regression Tree (BART) with application to controlled trail data analysis Bayesian Additive Regression Tree (BART) with application to controlled trail data analysis Weilan Yang wyang@stat.wisc.edu May. 2015 1 / 20 Background CATE i = E(Y i (Z 1 ) Y i (Z 0 ) X i ) 2 / 20 Background

More information

PROPENSITY SCORE MATCHING. Walter Leite

PROPENSITY SCORE MATCHING. Walter Leite PROPENSITY SCORE MATCHING Walter Leite 1 EXAMPLE Question: Does having a job that provides or subsidizes child care increate the length that working mothers breastfeed their children? Treatment: Working

More information

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score Causal Inference with General Treatment Regimes: Generalizing the Propensity Score David van Dyk Department of Statistics, University of California, Irvine vandyk@stat.harvard.edu Joint work with Kosuke

More information

NISS. Technical Report Number 167 June 2007

NISS. Technical Report Number 167 June 2007 NISS Estimation of Propensity Scores Using Generalized Additive Models Mi-Ja Woo, Jerome Reiter and Alan F. Karr Technical Report Number 167 June 2007 National Institute of Statistical Sciences 19 T. W.

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Targeted Maximum Likelihood Estimation in Safety Analysis

Targeted Maximum Likelihood Estimation in Safety Analysis Targeted Maximum Likelihood Estimation in Safety Analysis Sam Lendle 1 Bruce Fireman 2 Mark van der Laan 1 1 UC Berkeley 2 Kaiser Permanente ISPE Advanced Topics Session, Barcelona, August 2012 1 / 35

More information

Assess Assumptions and Sensitivity Analysis. Fan Li March 26, 2014

Assess Assumptions and Sensitivity Analysis. Fan Li March 26, 2014 Assess Assumptions and Sensitivity Analysis Fan Li March 26, 2014 Two Key Assumptions 1. Overlap: 0

More information

Quantitative Economics for the Evaluation of the European Policy

Quantitative Economics for the Evaluation of the European Policy Quantitative Economics for the Evaluation of the European Policy Dipartimento di Economia e Management Irene Brunetti Davide Fiaschi Angela Parenti 1 25th of September, 2017 1 ireneb@ec.unipi.it, davide.fiaschi@unipi.it,

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

Implementing Matching Estimators for. Average Treatment Effects in STATA

Implementing Matching Estimators for. Average Treatment Effects in STATA Implementing Matching Estimators for Average Treatment Effects in STATA Guido W. Imbens - Harvard University West Coast Stata Users Group meeting, Los Angeles October 26th, 2007 General Motivation Estimation

More information

Minimax-Regret Sample Design in Anticipation of Missing Data, With Application to Panel Data. Jeff Dominitz RAND. and

Minimax-Regret Sample Design in Anticipation of Missing Data, With Application to Panel Data. Jeff Dominitz RAND. and Minimax-Regret Sample Design in Anticipation of Missing Data, With Application to Panel Data Jeff Dominitz RAND and Charles F. Manski Department of Economics and Institute for Policy Research, Northwestern

More information

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing Alessandra Mattei Dipartimento di Statistica G. Parenti Università

More information

Selective Inference for Effect Modification

Selective Inference for Effect Modification Inference for Modification (Joint work with Dylan Small and Ashkan Ertefaie) Department of Statistics, University of Pennsylvania May 24, ACIC 2017 Manuscript and slides are available at http://www-stat.wharton.upenn.edu/~qyzhao/.

More information

Implementing Matching Estimators for. Average Treatment Effects in STATA. Guido W. Imbens - Harvard University Stata User Group Meeting, Boston

Implementing Matching Estimators for. Average Treatment Effects in STATA. Guido W. Imbens - Harvard University Stata User Group Meeting, Boston Implementing Matching Estimators for Average Treatment Effects in STATA Guido W. Imbens - Harvard University Stata User Group Meeting, Boston July 26th, 2006 General Motivation Estimation of average effect

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Causal Inference with Big Data Sets

Causal Inference with Big Data Sets Causal Inference with Big Data Sets Marcelo Coca Perraillon University of Colorado AMC November 2016 1 / 1 Outlone Outline Big data Causal inference in economics and statistics Regression discontinuity

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Genetic Matching for Estimating Causal Effects:

Genetic Matching for Estimating Causal Effects: Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies Alexis Diamond Jasjeet S. Sekhon Forthcoming, Review of Economics and

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Propensity Score Methods for Causal Inference

Propensity Score Methods for Causal Inference John Pura BIOS790 October 2, 2015 Causal inference Philosophical problem, statistical solution Important in various disciplines (e.g. Koch s postulates, Bradford Hill criteria, Granger causality) Good

More information

Causal Inference by Minimizing the Dual Norm of Bias. Nathan Kallus. Cornell University and Cornell Tech

Causal Inference by Minimizing the Dual Norm of Bias. Nathan Kallus. Cornell University and Cornell Tech Causal Inference by Minimizing the Dual Norm of Bias Nathan Kallus Cornell University and Cornell Tech www.nathankallus.com Matching Zoo It s a zoo of matching estimators for causal effects: PSM, NN, CM,

More information

Moving the Goalposts: Addressing Limited Overlap in Estimation of Average Treatment Effects by Changing the Estimand

Moving the Goalposts: Addressing Limited Overlap in Estimation of Average Treatment Effects by Changing the Estimand Moving the Goalposts: Addressing Limited Overlap in Estimation of Average Treatment Effects by Changing the Estimand Richard K. Crump V. Joseph Hotz Guido W. Imbens Oscar Mitnik First Draft: July 2004

More information

causal inference at hulu

causal inference at hulu causal inference at hulu Allen Tran July 17, 2016 Hulu Introduction Most interesting business questions at Hulu are causal Business: what would happen if we did x instead of y? dropped prices for risky

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Potential Outcomes Model (POM)

Potential Outcomes Model (POM) Potential Outcomes Model (POM) Relationship Between Counterfactual States Causality Empirical Strategies in Labor Economics, Angrist Krueger (1999): The most challenging empirical questions in economics

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

The propensity score with continuous treatments

The propensity score with continuous treatments 7 The propensity score with continuous treatments Keisuke Hirano and Guido W. Imbens 1 7.1 Introduction Much of the work on propensity score analysis has focused on the case in which the treatment is binary.

More information

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data? When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data? Kosuke Imai Department of Politics Center for Statistics and Machine Learning Princeton University

More information

SIMULATION-BASED SENSITIVITY ANALYSIS FOR MATCHING ESTIMATORS

SIMULATION-BASED SENSITIVITY ANALYSIS FOR MATCHING ESTIMATORS SIMULATION-BASED SENSITIVITY ANALYSIS FOR MATCHING ESTIMATORS TOMMASO NANNICINI universidad carlos iii de madrid UK Stata Users Group Meeting London, September 10, 2007 CONTENT Presentation of a Stata

More information

Measuring Social Influence Without Bias

Measuring Social Influence Without Bias Measuring Social Influence Without Bias Annie Franco Bobbie NJ Macdonald December 9, 2015 The Problem CS224W: Final Paper How well can statistical models disentangle the effects of social influence from

More information

Learning from Examples

Learning from Examples Learning from Examples Data fitting Decision trees Cross validation Computational learning theory Linear classifiers Neural networks Nonparametric methods: nearest neighbor Support vector machines Ensemble

More information

Imbens/Wooldridge, Lecture Notes 1, Summer 07 1

Imbens/Wooldridge, Lecture Notes 1, Summer 07 1 Imbens/Wooldridge, Lecture Notes 1, Summer 07 1 What s New in Econometrics NBER, Summer 2007 Lecture 1, Monday, July 30th, 9.00-10.30am Estimation of Average Treatment Effects Under Unconfoundedness 1.

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Matching. Stephen Pettigrew. April 15, Stephen Pettigrew Matching April 15, / 67

Matching. Stephen Pettigrew. April 15, Stephen Pettigrew Matching April 15, / 67 Matching Stephen Pettigrew April 15, 2015 Stephen Pettigrew Matching April 15, 2015 1 / 67 Outline 1 Logistics 2 Basics of matching 3 Balance Metrics 4 Matching in R 5 The sample size-imbalance frontier

More information

Estimation of the Conditional Variance in Paired Experiments

Estimation of the Conditional Variance in Paired Experiments Estimation of the Conditional Variance in Paired Experiments Alberto Abadie & Guido W. Imbens Harvard University and BER June 008 Abstract In paired randomized experiments units are grouped in pairs, often

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Machine Learning (CS 567) Lecture 5

Machine Learning (CS 567) Lecture 5 Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification

More information

CompSci Understanding Data: Theory and Applications

CompSci Understanding Data: Theory and Applications CompSci 590.6 Understanding Data: Theory and Applications Lecture 17 Causality in Statistics Instructor: Sudeepa Roy Email: sudeepa@cs.duke.edu Fall 2015 1 Today s Reading Rubin Journal of the American

More information

Machine Learning for Signal Processing Bayes Classification

Machine Learning for Signal Processing Bayes Classification Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1 Recap: KNN A very effective and simple way of performing classification

More information

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Pengtao Xie, Machine Learning Department, Carnegie Mellon University 1. Background Latent Variable Models (LVMs) are

More information

arxiv: v1 [stat.me] 15 May 2011

arxiv: v1 [stat.me] 15 May 2011 Working Paper Propensity Score Analysis with Matching Weights Liang Li, Ph.D. arxiv:1105.2917v1 [stat.me] 15 May 2011 Associate Staff of Biostatistics Department of Quantitative Health Sciences, Cleveland

More information

Applied Microeconometrics (L5): Panel Data-Basics

Applied Microeconometrics (L5): Panel Data-Basics Applied Microeconometrics (L5): Panel Data-Basics Nicholas Giannakopoulos University of Patras Department of Economics ngias@upatras.gr November 10, 2015 Nicholas Giannakopoulos (UPatras) MSc Applied Economics

More information

Combining multiple observational data sources to estimate causal eects

Combining multiple observational data sources to estimate causal eects Department of Statistics, North Carolina State University Combining multiple observational data sources to estimate causal eects Shu Yang* syang24@ncsuedu Joint work with Peng Ding UC Berkeley May 23,

More information

Bayesian Inference for Sequential Treatments under Latent Sequential Ignorability. June 19, 2017

Bayesian Inference for Sequential Treatments under Latent Sequential Ignorability. June 19, 2017 Bayesian Inference for Sequential Treatments under Latent Sequential Ignorability Alessandra Mattei, Federico Ricciardi and Fabrizia Mealli Department of Statistics, Computer Science, Applications, University

More information

A Note on Adapting Propensity Score Matching and Selection Models to Choice Based Samples

A Note on Adapting Propensity Score Matching and Selection Models to Choice Based Samples DISCUSSION PAPER SERIES IZA DP No. 4304 A Note on Adapting Propensity Score Matching and Selection Models to Choice Based Samples James J. Heckman Petra E. Todd July 2009 Forschungsinstitut zur Zukunft

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Predicting the Treatment Status

Predicting the Treatment Status Predicting the Treatment Status Nikolay Doudchenko 1 Introduction Many studies in social sciences deal with treatment effect models. 1 Usually there is a treatment variable which determines whether a particular

More information