arxiv: v1 [stat.me] 18 Nov 2018
|
|
- Todd Bond
- 5 years ago
- Views:
Transcription
1 MALTS: Matching After Learning to Stretch Harsh Parikh 1, Cynthia Rudin 1, and Alexander Volfovsky 2 1 Computer Science, Duke University 2 Statistical Science, Duke University arxiv: v1 [stat.me] 18 Nov 2018 November 20, 2018 Abstract We introduce a flexible framework for matching in causal inference that produces high quality almostexact matches. Most prior work in matching uses ad hoc distance metrics, often leading to poor quality matches, particularly when there are irrelevant covariates that degrade the distance metric. In this work, we learn an interpretable distance metric used for matching, which leads to substantially higher quality matches. The distance metric can stretch continuous covariates and matches exactly on categorical covariates. The framework is flexible in that the user can choose the form of distance metric, the type of optimization algorithm, and the type of relaxation for matching. Our ability to learn flexible distance metrics leads to matches that are interpretable and useful for estimation of conditional average treatment effects. 1 Introduction Matching methods are pervasively used throughout the social and health sciences to make causal conclusions where access to randomized control trial is scarce, but when observational data are widely available. Matching methods find groups of similar individuals, some of whom select into treatment and some of whom select into control. As such, matching methods are interpretable in that they allow fine-grained troubleshooting of the data; for instance, examining the matched groups may allow the user to locate unmeasured confounders that lead units to have a higher chance of being treated. Matching is also powerful in that it can allow the user to correctly estimate highly nonlinear treatment effects but this is only true when the matches are of high quality. The quality of the matches is our main consideration in this work. Typically, matching methods place units into the same matched group when they are close together, where closeness is measured in terms of a pre-defined distance (e.g., exact, coarsened exact, Euclidean, etc.), while maintaining balance constraints between treatment and control units. Despite its merits, this classical paradigm has serious flaws, namely that it relies heavily on a properly specified distance metric. The distance metric cannot be determined without an understanding of the importance of the variables; for instance, the quality of any prespecified distance metric that weighs all covariates equally will degrade as the number of irrelevant covariates increases. This is true no matter which matching methodology is used as the distance metric calculations could be essentially determined by these irrelevant covariates. This has previously been referred to as the toenail problem (Wang et al., 2017), where the weighing of irrelevant covariates (like toenail length) can overwhelm the metric for matching. A separate concern is that the covariates may be scaled differently, where a given distance along one covariate has a different impact than the same distance along a different covariate; in this case, the total distance metric can easily be determined by less relevant covariates, again leading to lower quality matched groups. Ideally, the distance metric would stretch to capture important covariates so that after matching, estimates of treatment effects within the matched groups would be accurate. If the researcher knows how to 1
2 choose the distance metric so that it leads to accurate treatment effect estimates, this would solve the problem. However there is no reason to believe that this is achievable in high-dimensional settings. Producing high dimensional functions to characterize data is a task for which humans alone are not naturally adept. In this work, we propose a framework for matching where an interpretable distance measure between matched units is learned from a held-out training set. As long as the distance metric generalizes from the training set to the full sample, we are able to compute high-quality matches and accurate estimates of treatment effects (conditional average treatment effects CATEs) within the matched groups. One can use any form of distance metric to train, and in this work we focus on exact matching for discrete variables and generalized Mahalanobis distances for continuous variables. By definition, the generalized Mahalanobis distance is determined by a matrix. If the matrix is diagonal, the distance calculation is a stretch for each covariate. Irrelevant covariates will be compressed so that their values are effectively always zero. Highly relevant covariates will be stretched, so that to be considered a match, the values of that covariate must be very similar. In this way, diagonal matrices lead to very interpretable distance metrics. If the Mahalanobis distance matrix is not constrained to be diagonal, then it induces a stretch and rotation, leading to more flexible but less interpretable notions of distance. The new framework is called Learning-to-Match, and the algorithm introduced in this work is called Matching After Learning to Stretch (MALTS). We tested MALTS against several other matching methods in simulation studies, where ground truth is known, and here, MALTS achieves substantially and consistently better results than other methods including Genmatch, propensity score matching, standard Mahalanobis distance and coarsened exact matching. Even though our method is heavily constrained to produce interpretable matches, it performs at the same level as non-matching methods that are designed to fit extremely flexible but uninterpretable models directly to the response surface. 2 Related work Since the 1970 s, the causal inference literature on matching methods has concentrated on dimension reduction techniques (e.g., Rubin, 1973a,b, 1976; Cochran and Rubin, 1973). The leading dimension reduction approach in this literature uses the propensity score, the conditional probability of treatment given covariate information. Propensity score methods target average treatment effects and so do not produce exact matches or almost-exact matches. When treatment is binary, they project data onto one dimension, and closeness of units in propensity score does not imply their closeness in covariate space. As a result, the matches cannot directly be used for estimating heterogeneous treatment effects. Regression methods can be used for CATE estimation, but this assumes that the regression method is correctly specified or in the case of doubly robust estimation (e.g., Farrell, 2015) either the propensity model or the outcome model needs to be correctly specified. In practice neither is likely to be correctly specified. Machine learning approaches generalize regression approaches and can create models that are extremely flexible and predict outcomes accurately for both treatment and control groups (Hahn et al., 2017; Hill, 2011; Chernozhukov et al., 2016). However this loses the interpretability inherent to almost-exact matches. In practice, MALTS performs similarly to (or better than) several machine learning methods in our experiments, despite being restricted to interpretable almost-exact matches. A flexible setup for producing high quality matches is the optimal matching literature (Rosenbaum, 2016), which started with network flow algorithms and has evolved to use integer programming to produce matches that are constrained in user-defined ways (Zubizarreta, 2012; Zubizarreta et al., 2014; Keele and Zubizarreta, 2014; Resa and Zubizarreta, 2016; Noor-E-Alam and Rudin, 2015b,a; Kallus, 2017). In all of these approaches, the user defines the distance metric rather than learning it through data, which is time consuming and likely inaccurate, leading potentially to poor quality of the matched groups. Further, integer programming methods do not scale to problems of the sizes we consider in this manuscript. An alternative to optimal matching is coarsened exact matching (Iacus et al., 2012), an approach that requires users to specify explicit bins for all covariates on which to construct almost-exact matches. This requires users to know in advance which high-dimensional bins the outcome will be insensitive to, which is essentially equivalent to requiring the user to know the answer to the problem we investigate in this work. 2
3 Large amounts of user choice to define these bins can also lead to unintentional user bias. By training the stretching rather than asking the user to define it as in CEM, this bias is reduced. 3 Learning-to-Match Framework Throughout we will write yi T (x i) for the treated potential outcome of an individual with observed pretreatment covariates x i and yi C(x i) for the control potential outcome. When notationally obvious we simplify these to y T, y C, and x. We assume that there are no unobserved confounders and that the stanadrd ignorability assumptions hold (Rubin, 2005). For each individual we define their conditional treatment effect as yi T (x i) yi C(x i). In our framework, a learning-to-match algorithm consist of three modeling decisions: the form of distance metric used for matching, the method of learning parameters of that distance metric, and the method of matching. A training set is used to train the parameters of the distance metric, and that learned distance metric is used on the rest of the sample in the test-set for matching and CATE prediction. Formally, a distance function is a map distance(x i, x k ) R that is symmetric, positive definite, and obeys the triangle inequality, where covariates x i are in R p. Our goal is to minimize the expected loss between estimated treatment effects TE(x) and true treatment effects TE(x) = E(y T (x) y C (x)) across a target population µ(x, y T, y C ) (this can either be a finite or super-population). We use hat notation ˆ for estimates. Let the expected loss be: E µ l( TE(x), TE(x)) = l( TE(x), TE(x))dµ(x, y T, y C ) µ = l(ŷ T (x) ŷ C (x), y T (x) y C (x))dµ(x, y T, y C ). µ We use squared loss, l(a, b) = (a b) 2. If we had a random i.i.d. sample {x i } i from the marginal distribution of the covariates µ(x), and if we had labels y T (x i ) and y C (x i ) we could then approximate the expected loss by the average loss over the sample: E µ l( TE(x), TE(x) 1 n l(ŷ T (x i ) ŷ C (x i ), y T (x i ) y C (x i )), n i=1 however, the difficulty in causal inference is that we only observe treatment outcomes y T (x i ) or control outcomes y C (x i ) for an individual i, so we cannot directly calculate the treatment effect. Breaking the sum into treatment and control group: E µ l( TE(x), TE(x)) 1 n t i treated l(ŷt (x i ) ŷ C (x i ), y T (x i ) y C (x i )) + 1 n c i control l(ŷt (x i ) ŷ C (x i ), y T (x i ) y C (x i )). If the target population is the treatment group (for estimating conditional average treatment effects on the treated CATT), then the sum over the control group would be removed from the quantity above. Consider one term from the treatment group. We know y T (x i ) and thus do not estimate it, so the term becomes l(y T (x i ) ŷ C (x i ), y T (x i ) y C (x i )). We still do not know y C (x i ), and must estimate it. If we use matching, we will compute the estimated control outcome ŷ C (x i ) by an average of the control outcomes within its matched group that we can observe. Let us define the matched group for unit i in terms of the observed control units indexed by k, MG(i, distance, {x C k } k, {y C k } k ) = {k : distance(x C k, x C i ) < δ}. (1) We allow overlap of matched groups in this notation, thus these are bounded-distance nearest-neighbor matches, using the specified distance metric. Thus, ŷ C (x i ) = 1 K k MG(i,distance,{x C k } k,{yk C} k) yk C, (2) 3
4 where K is the size of the matched group MG(i, distance, {x C k } k, {yk C} k). In order for matching to yield a high quality estimate ŷi C, it is sufficient (but not necessary) for the distance function to have the following property: Smoothness-of-Outcomes-for-Distance Property (SOD). A distance measure has the SOD property if for any observed values of a and b: distance(x a, x b ) < δ y T a y T b < ɛ. (3) In other words, according to the SOD condition, if we consider all units within a small distance of x a, the value of the outcome is almost constant. If this were true for all units, then we automatically have 1 K (yi C ɛ) ŷ C (x i ) 1 K k MG (yi C + ɛ) k MG y C i ɛ ŷ C (x i ) y C i + ɛ. where both summations are over k MG ( i, distance, {x C k } k, {yk C} k). Thus, the SOD condition would suffice to provide us with high quality estimates for counterfactuals, and therefore, low losses on estimates of conditional average treatment effects. Our framework learns a distance metric from a separate training set of data (not the test samples considered in the averages above), so that the distance approximately has the SOD property. Denote this training set by {x T,tr i, y T,tr i } i, {x C,tr i property, we minimize the following: distance arg, y C,tr i min distance } i. To learn distance so that its distance has an approximate SOD ( i treated + i control y tr,t i ) 2 ŷ T (x tr,t i ) ( y tr,c i ŷ C (x tr,c i ) where ŷ C (x tr,c i ) is defined by Equations (1) and (2) including its dependence on the distance, using the training data for creating matched groups, and ŷ T (x tr,t i ) is defined analogously. More generally we can write the step above as: distance arg min distance LearnDistance({xT,tr i, y T,tr i 4 Matching After Learning to Stretch ) 2, } i, {x C,tr i, y C,tr i } i, distance). (4) MALTS is an almost exact matching method for causal inference and treatment estimation designed to work on both experimental and observational datasets. MALTS performs treatment effect estimation using following three stages: 1) distance metric learning, 2) matching samples, and 3) estimating CATEs. Let us start with nearest neighbor matching using a learned weight function that depends on the distances between points. The weights for the weighted nearest neighbors matching can be learned via this specification of Eq (4): 2 W = arg min W + i control i treated y tr,c i y tr,t i j control,i j j treated,i j W i,jy tr,c j W i,jy tr,t j 2. We let W i,j be a function of the distances {distance(x i, x j )} ij. For example, the W i,j can be binary to encode whether j belongs to i s K-nearest neighbors. Alternatively they can encode soft KNN weights where W i,j = e distance(xi,xj). In the remainder of the paper we consider distance functions for W that are parameterized by a set of parameters L. 4
5 Euclidean distance in X space works best for continuous covariates. As such, one might consider distances of the form Lx a Lx b 2 where L codes the orientation of the data. An example of this in the causal inference literature is the classical Mahalanobis distance where L is hardcoded as the inverse covariance matrix for the observed Xs. This approach has been demonstrated to perform well in settings where all covariates are observed and the inferential target is the average treatment effect Stuart (2010). We are also interested in individualized treatment effects and just as the choice of Euclidean norm in Mahalanobis distance matching depends on the estimand of interest, the stretch metric needs to be amended for this new estimand. We propose learning L directly from the observed data rather than setting it up front; this is because we do not know the true distance metric, and humans are not good at creating high dimensional functions in our heads from data. The distance metric can be learned such that the objective with respect to W is minimized. For discrete (categorical/unordered) data, Euclidean distance is not a natural distance metric mixed (ordered+unordered) data poses a different set of challenges than either one alone, given the geometry of the space. Approximate closeness is ill-defined for unordered discrete covariates while exact matching is not usually possible on continuous covariates as it is unlikely for two points to have exactly the same values (assuming a continuous distribution for continuous covariates, a match would occur with almost zero probability). If we use the same distance metric for both discrete and continuous data, then units that are close in ordinal space might be arbitrarily far in categorical space. Because of this, it is not possible to parameterize a single form of distance metric to enforce both exact matching on categorical data and almost exact matching for ordinal data. While Mahalanobis distance matching papers recommend converting unordered categorical variables to binary indicators, this approach does not scale and in fact can introduce an overwhelming number of irrelevant covariates. We propose parametrizing our distance metric in terms of two components: one is a learned weighted Euclidean distance for continuous (or generally ordered) covariates while the other is learned weighted hamming distance for discrete (or generally unordered) covariates. These components are separately parameterized by matrices L c and L d respectively. Let a c, b c denote continuous covariates while a d, b d denote discrete ones. The distance metric we propose to work with is given by: distance L (a, b) = d Lc (a c, b c ) + d Ld (a d, b d ) d Lc (a c, b c ) = L c a c L c b c 2, d Ld (a d, b d ) = L (i,i) d 1[a (i) d b(i) d ], where 1[A] is the indicator that event A occurred. We thus perform almost-exact matching on the unordered covariates and learned-mahalanobis-distance matching on the rest of the covariates. We thus learn L such that we have approximately solved: 2 2 L arg min L i treatment group y tr,t i 1 K k KNN L (T,x i ) y tr,t k + a d i=0 i control group y tr,c i 1 K k KNN L (C,x i ) y tr,c k, where KNN L (B, x i ) is defined as the set of K nearest points to x i using the metric distance(x i, x k ) parameterized by L. For interpretability, we let L c be a diagonal matrix, which allows stretches or reflections of the continuous covariates. This way, the magnitude of an entry in L c or L d provides the relative importance of the indicated covariate for the causal inference problem. For simplicity in the formulation above, we assumed additive separability of ordered and unordered parts. If desired, separate stretch metrics L c can be learned for each discrete bin; a natural extension (not explored here) is to use Bayesian shrinkage to pull the stretch matrices together for the different bins. We use python scipy library s implementation of COBYLA, a non-gradient optimization method to learn the optimal L Jones et al. (01 ); Powell (1994). After learning the distance matrix on the training data, we used the new learned distance metric to predict conditional average treatment effects (CATEs) for each unit in the test-set, using nearest neighbors from the training set. For any given new point x in the p-dimensional covariate space we construct the 5
6 matched group using control set C via KNN L (C, x), and using treatment set T via KNN L (T, x). Estimated CATE for a point x is calculated via KNN L for treatment and control: ĈATE(x) = 1 ( ) y tr,t i y tr,c i. K i KNN L (T,x) i KNN L (C,x) If we were to abandon the goal of an interpretable distance function, the framework trivially generalizes to incorporate complex data structures by introducing a neural network (or other flexible learning) framework for coding the data. That is, we can redefine the distance metric via distance L (x i, x j ) = φ L (x i ), φ L (x j ) or distance L (x i, x j ) = (φ L (x i ) φ L (x j )) 2, where φ is a summary of relevant data features that is learned using a neural network or other complex modeling framework. Because deep neural networks tend to show improvements over other methods only for certain problems where latent spaces need to be constructed (computer vision, speech), we expect that the stretch/almost-exact match combination should suffice for most datasets. 5 Experiments In this section, we discuss the the performance of MALTS on both synthetically generated datasets (ordinal covariates and mixed covariates) and real world datasets. 5.1 Ordinal Covariates A variety of experimental settings were designed where the covariates were sampled from a continuous distribution. We analyzed MALTS by studying error rates in prediction of CATEs, analyzing the matched groups produced by MALTS, and pattern of change in error rate as number of covariates and number of units in dataset changes. The data generation process used for testing MALTS on continuous covariates includes quadratic treatment effect terms in addition to a linear treatment effect and linear baseline effect, as follows: Data generative process-1 (with independent covariates): x N (µ, Σ), µ = [3, 3.., 3] p 1, Σ = 0.5 I(p p) p(a i = 1) = p(a i = 1) = 1 2, β i = a i 10 2 i, α i N (1, 0.5) k k k y = β i x i + T α i x i + T x i x j. i=1 i=1 Data generative process-2 (with correlated covariates): i,j=1,j>i x N (µ, Σ), µ = [3, 3.., 3] p 1, Σ = 0.5 Σ = (1 ρ)i + ρ11 T, ρ = 0.2 p(a i = 1) = p(a i = 1) = 1 2, β i = a i 10 2 i, α i N (1, 0.5) k k k y = β i x i + T α i x i + T x i x j. i=1 i=1 i,j=1,j>i In these two DGPs we generate p covariates, with k of them contributing to the outcome (that is, there are p k irrelevant covariates). 6
7 Figure 1: Variance within matched groups tends to be smaller for important covariates. Average variance within matched groups for each covariate during the testing phase of MALTS. Variables 1, 2, and 3 are important and thus are matched more closely by MALTS, leading to lower variances within groups Variance within the Matched groups We generated 10 different training sets where p = k = 10 and n c = n t = We test the method on 10 different test datasets with n test = Recall that during the training phase, MALTS learns the distance metric in a way that stretches the more relevant covariates while compressing the irrelevant covariates in order to better predict the outcome. Because the β s of the true model are exponentially decreasing in absolute value, MALTS should ideally learn a distance metric with the largest stretch for the first covariate, second largest stretch for the second one, and so on. This ensures that the units within a matched group will be closest on the covariates that affect the outcome the most. A natural measure of covariate balance in the test set is the variability of a covariate within a matched group. After MALTS we expect that the average variance of the first covariate dimension within matched groups is the lowest, followed by the second, followed by the third, until prediction uncertainty overwhelms the (diminishing) importance of covariates. For this collection of datasets, Figure 1 plots the average variances for each covariate. We note that the variance stops increasing beyond the fourth covariate that is the matching mechanism does not necessarily distinguish between the importance of covariates above index four. As the contribution of these covariates to the outcome is less than 10% of the magnitude of the first covariate this suggests that we are likely learning an appropriate distance metric Number of covariates and size of training set In this simulation we consider the performance of our distance-metric-learning task as a function of the size of the training set and number of covariates. Using the same simulation setup as above we increase p = k, n c, and n t to study this behavior. In this simulation the ATE is given by 4.5p 2 1.5p and Figure 2 plots the average CATE absolute error for different training set sizes. For a fixed number of covariates we observe that increasing the size of the training set can reduce the absolute error in estimating the CATEs by nearly 50%. Increasing the number of covariates makes the inferrential task substantially more complex thus leading to an increase in absolute error. However, relative to the increasing size of the actual CATEs (which are on the order of the ATE), we actually see a reduction in relative error. 7
8 Figure 2: Amount of data to achieve low error depends on the number of covariates. Variation of error rate vs increasing number of covariates for datasets of different sizes Error-rate analysis Figure 3: MALTS performs well with respect to other methods. Violin plots of CATE Absolute Error on test set for a variety of methods. MALTS performs well, despite being a matching method. (a) The test data have independent covariates. (b) The test data have covariance matrix Σ = (1 ρ)i + ρ11 T where ρ = 0.2. (c) Frequency plot of CATE absolute error for results shown in the middle figure. More detailed results on the distributions are in Figure 5 and 6. In this simulation we compare MALTS with several other methods: BARTChipman et al. (2010), CRF Athey et al. (2015), difference of random forests, GenMatchDiamond and Sekhon (2013) and Propensity Score Nearest Neighbor matchingross et al. (2015). We study two different settings: one with uncorrelated and one with correlated covariates. Uncorrelated Xs : The training set contains n c = 1000 control units and n t = 1000 treated units. There are p = 10 total covariates observed, but only k = 8 associated with the outcome. Correlated Xs : The training set contains n c = n t = 500 with p = 18 total covariates observed but only k = 8 associated with the outcome. In both settings, the test set has n test = 5000 units. 8
9 Figure 4: Tighter matched groups are higher quality. Scatter plot of CATE estimates for different matched groups versus the reciprocal of the diameter of the matched group. Dashed curve is a fitted exponential. One may choose to prune matched groups based on diameter to remove less reliable large diameter groups. Figure 3 compares the errors in estimating CATEs in both settings using the different methods. We note that the other matching methods are not designed for CATE estimation and so perform very poorly. On the other hand we see that MALTS performs on the order of the best modeling method, BART. Figure 4(a) is based on the the uncorrelated simulation and plots the reciprocal of the diameter of each matched group (where diameter is defined as the maximum distance of matched samples to the query sample in a matched group) versus the absolute CATE error. We note that tighter groups are of higher quality and lead to better estimation of CATEs. We can threshold and remove bad quality matched groups as function of diameter or we can weight the matched group as a function of diameter in estimation of estimands. Figures 5 and 6 provide a detailed view of the performance of the different methods. Of the interpretable matching methods it is clear that MALTS performs the best. It also outperforms less interpretable black-box methods like CRF and different of random forests. We also note that when covariates are correlated, the performance of MALTS improves even though the number of training samples is halved while the number of covariates is almost doubled. 5.2 Mixed Covariates We test MALTS on mixed data using the same the following data generative process: x c N (µ, Σ), µ = [1, 1.., 1] pc 1, Σ = 0.5 I(p c p c ) x d1,...,dpd iid Bernoulli(1/2) We consider p c = 10 continuous covariates where k c = 4 are associated with the outcome. We also consider p d = 45 discrete covariates where k c = 5 are associated with the outcome. Letting x represent the 9 important covariates and letting k = k c + k d = 9 we generate the outcomes y according to the following quadratic model: k k k y = β i x i + T α i x i + T x i x j where i=1 i=1 i,j=1,j>i p(a i = 1) = p(a i = 1) = 1 2, β i = N (a i 10, 1), α i N (1, 0.5). 9
10 (a) MALTS (b) Genmatch (c) Nearest Neighbor Matching (d) Difference of two BARTs (e) Causal Forest (f) Difference of two RF Figure 5: Another view of performance of different methods, more detailed than Figure 3 (a). Scatter plots for true value of CATEs vs predicted CATEs for different methods where all the covariates are independent. Figure 7 provides a summary of the errors in estimating CATEs across the different methods. We note that MALTS continues to perform as well as methods that directly model the outcome while outperforming the more interpretable matching methods. Figure 8 provides an in depth view of these approaches. In particular, we note that MALTS is better able to adapt to the unimportant discrete covariates than both CRF and the different of two random forests. 5.3 Real Dataset: Lalonde In this section we study the performance of MALTS on the classical Lalonde dataset. The data describe the National Support Work Demonstration (NSW) temporary employment program and its effect on income level of the participants LaLonde (1986). This dataset is frequently used as a benchmark for the performance of methods for observational causal inference. We employ the male sub-sample from the NSW in our analysis as well as the PSID-2 control sample (of male household heads under age 55 who did not classify themselves as retired in 1975 and who were not working when surveyed in the spring of 1976) Dehejia and Wahba (1999b). The outcome variable for both experimental and observational analyses is earnings in 1978 and the considered variables are age, education, whether a respondent is black, is Hispanic, is married, has a degree, and their earnings in 1975 Dehejia and Wahba (1999a). Previously, it has been demonstrated that almost any adjustment during the analysis of the experimental and observational variants of these data (both by modeling the outcome and by modeling the treatment variable) can lead to extreme bias in the estimate of average treatment effects. Tables 1 and 2 present the average treatment effect estimates based on MALTS and state of the art modeling and matching methods. We note that MALTS (after thresholding for high diameter matched group according to the procedure described in Section 5.1.3) is able to achieve 10
11 (a) MALTS (b) Genmatch (c) Nearest Neighbor Matching (d) Difference of two BARTs (e) Causal Forest (f) Difference of two RF Figure 6: Another view of performance of different methods, more detailed than Figure 3 (b). Scatter plots for true value of CATEs vs predicted CATEs for different methods where covariance matrix Σ = (1 ρ)i +ρ11 T where ρ = 0.2. accurate ATE estimation based on both experimental and observational datasets while not discarding too many individuals. As MALTS is a method for constructing interpretable matched groups (which allow for proper CATE estimation) we present sample output matched groups of MALTS for the observational Lalonde dataset in Table 3. At the top of the table we present the learned stretches for the distance metric (L) and note that matching on age appears to be extremely important, followed by education and income in We present two individuals for whom we want to construct matched groups: Query 1 is an 18 year old individual with no income in We are able to construct a tight matched group for this individual (both in control and in treatment). Query 2 is a 17-year-old high income individual, which is an extremely unlikely scenario, leading to a matched group with a very large diameter, which should probably not be used during analysis. 11
12 Figure 7: MALTS performs well on mixed-data. Violin plots of CATE Absolute Error on test set for variety of methods on dataset with mixed and independent covariates. (a) MALTS (b) Genmatch (c) Nearest Neighbor Matching (d) Difference of two BARTs (e) Causal Forest (f) Difference of two RF Figure 8: Scatter plots for true value of CATEs vs predicted CATEs for different methods where all the covariates are independent 12
13 Table 1: Predicted ATE for different methods on full Lalonde s experimental dataset. MALTS uses e 0.15 diameter(matched group) as the weights for each matched group in weighted estimation of ATE. In this setting NN matching is done with replacement. Method ATE Estimate Estimation Bias (%) Number of units matched Truth MALTS (Weighted) CRF BART Genmatch Nearest Neighbor Table 2: Predicted ATE for different methods on Lalonde s observation PSID-2 control dataset and experimental treatment dataset. MALTS uses e diameter(matched group) as the weights for each matched group in weighted estimation of ATE. In this setting NN matching is done with replacement. Method ATE Estimate Estimation Bias (%) Number of units matched Truth MALTS (Weighted) CRF BART Genmatch Nearest Neighbor Figure 9: Scatter Plot showing variation of CATEs with reciprocal of diameter (the surrogate for quality of matched groups) for Lalonde observational PSID-2 dataset. The red line shows an abrupt change in variance of predicted CATEs. The red line can also be used as a threshold to prune bad quality matches to estimate ATE. 13
14 Table 3: Example of Matched Groups on Lalonde Experimental treatment and Observational control Datasets for two example query points drawn from same datasets. Query 1 represents a high quality (low diameter) matched group while Query 2 represents a poor quality (high diameter) matched group that should be discarded during analysis. Age Education Black Hispanic Married Nodegree re75 L Age Education Black Hispanic Married Nodegree re75 re78 T Query Mean Query Mean
15 6 Conclusion and Discussion This paper introduced MALTS, a new method for causal inference and treatment effect estimation. MALTS learns a metric on the covariate space that puts more weight on important covariates and so produces high quality matched groups for analysis. MALTS is able to deal with irrelevant covariates by downweighting their importance in the weighted nearest neighbor algorithm that produces the matches. Unlike other competitive black-box methods including BART, Causal Forest or difference of two random forests, MALTS produces interpretable matched groups and also returns the stretch matrix highlighting relevant covariates. A natural extension to the introduced MALTS framework is to use neural networks or support vector machines to learn a flexible distance metric in a latent space, thus allowing us to match on images and text documents. This setup can be extended to studying treatment effect in texts or images. The MALTS framework can further be extended to deal with missing covariates, and can be adapted to instrumental variables, which is an ongoing effort. References Athey, S., Eckles, D., and Imbens, G. W. (2015). Exact p-values for network interference. Technical report, National Bureau of Economic Research. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., and Newey, W. K. (2016). Double machine learning for treatment and causal parameters. Technical report, cemmap working paper, Centre for Microdata Methods and Practice. Chipman, H. A., George, E. I., and Mcculloch, R. E. (2010). Bart: Bayesian additive regression trees. Annals of Applied Statistics, pages Cochran, W. G. and Rubin, D. B. (1973). Controlling bias in observational studies: A review. Sankhyā: The Indian Journal of Statistics, Series A, pages Dehejia, R. and Wahba, S. (1999a). Lalonde data. [Online; accessed ]. Dehejia, R. H. and Wahba, S. (1999b). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association, 94(448): Diamond, A. and Sekhon, J. S. (2013). Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. The Review of Economics and Statistics, 95(3): Farrell, M. H. (2015). Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics, 189(1):1 23. Hahn, P. R., Murray, J. S., and Carvalho, C. M. (2017). Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1): Iacus, S. M., King, G., and Porro, G. (2012). Causal inference without balance checking: Coarsened exact matching. Political analysis, 20(1):1 24. Jones, E., Oliphant, T., Peterson, P., et al. (2001 ). SciPy: Open source scientific tools for Python. [Online; accessed ]. Kallus, N. (2017). A Framework for Optimal Matching for Causal Inference. In Singh, A. and Zhu, J., editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages , Fort Lauderdale, FL, USA. PMLR. 15
16 Keele, L. and Zubizarreta, J. R. (2014). Optimal multilevel matching in clustered observational studies: A case study of the school voucher system in chile. arxiv preprint arxiv: LaLonde, R. J. (1986). Evaluating the Econometric Evaluations of Training Programs with Experimental Data. American Economic Review, 76(4): Noor-E-Alam, M. and Rudin, C. (2015a). experiments. Working paper. Robust nonparametric testing for causal inference in natural Noor-E-Alam, M. and Rudin, C. (2015b). Robust testing for causal inference in natural experiments. Working paper. Powell, M. J. D. (1994). A Direct Search Optimization Method That Models the Objective and Constraint Functions by Linear Interpolation, pages Springer Netherlands, Dordrecht. Resa, M. and Zubizarreta, J. R. (2016). balance. Statistics in Medicine. Evaluation of subset matching methods and forms of covariate Rosenbaum, P. R. (2016). Imposing minimax and quantile constraints on optimal matching in observational studies. Journal of Computational and Graphical Statistics, (just-accepted). Ross, M. E., Kreider, A. R., Huang, Y.-S., Matone, M., Rubin, D. M., and Localio, A. R. (2015). Propensity score methods for analyzing observational data like randomized experiments: challenges and solutions for rare outcomes and exposures. American journal of epidemiology, 181(12): Rubin, D. B. (1973a). Matching to remove bias in observational studies. Biometrics, pages Rubin, D. B. (1973b). The use of matched sampling and regression adjustment to remove bias in observational studies. Biometrics, pages Rubin, D. B. (1976). Multivariate matching methods that are equal percent bias reducing, i: Some examples. Biometrics, pages Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100: Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statist. Sci., 25(1):1 21. Wang, T., Roy, S., Rudin, C., and Volfovsky, A. (2017). Flame: A fast large-scale almost matching exactly approach to causal inference. arxiv preprint arxiv: Zubizarreta, J. R. (2012). Using mixed integer programming for matching in an observational study of kidney failure after surgery. Journal of the American Statistical Association, 107(500): Zubizarreta, J. R., Paredes, R. D., and Rosenbaum, P. R. (2014). Matching for balance, pairing for heterogeneity in an observational study of the effectiveness of for-profit and not-for-profit high schools in chile. The Annals of Applied Statistics, 8(1):
Achieving Optimal Covariate Balance Under General Treatment Regimes
Achieving Under General Treatment Regimes Marc Ratkovic Princeton University May 24, 2012 Motivation For many questions of interest in the social sciences, experiments are not possible Possible bias in
More informationBayesian regression tree models for causal inference: regularization, confounding and heterogeneity
Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity P. Richard Hahn, Jared Murray, and Carlos Carvalho June 22, 2017 The problem setting We want to estimate
More informationFlexible Estimation of Treatment Effect Parameters
Flexible Estimation of Treatment Effect Parameters Thomas MaCurdy a and Xiaohong Chen b and Han Hong c Introduction Many empirical studies of program evaluations are complicated by the presence of both
More informationSelection on Observables: Propensity Score Matching.
Selection on Observables: Propensity Score Matching. Department of Economics and Management Irene Brunetti ireneb@ec.unipi.it 24/10/2017 I. Brunetti Labour Economics in an European Perspective 24/10/2017
More informationBayesian causal forests: dealing with regularization induced confounding and shrinking towards homogeneous effects
Bayesian causal forests: dealing with regularization induced confounding and shrinking towards homogeneous effects P. Richard Hahn, Jared Murray, and Carlos Carvalho July 29, 2018 Regularization induced
More informationCausal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies
Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies Kosuke Imai Department of Politics Princeton University November 13, 2013 So far, we have essentially assumed
More informationGov 2002: 5. Matching
Gov 2002: 5. Matching Matthew Blackwell October 1, 2015 Where are we? Where are we going? Discussed randomized experiments, started talking about observational data. Last week: no unmeasured confounders
More informationWhat s New in Econometrics. Lecture 1
What s New in Econometrics Lecture 1 Estimation of Average Treatment Effects Under Unconfoundedness Guido Imbens NBER Summer Institute, 2007 Outline 1. Introduction 2. Potential Outcomes 3. Estimands and
More informationESTIMATION OF TREATMENT EFFECTS VIA MATCHING
ESTIMATION OF TREATMENT EFFECTS VIA MATCHING AAEC 56 INSTRUCTOR: KLAUS MOELTNER Textbooks: R scripts: Wooldridge (00), Ch.; Greene (0), Ch.9; Angrist and Pischke (00), Ch. 3 mod5s3 General Approach The
More informationDOCUMENTS DE TRAVAIL CEMOI / CEMOI WORKING PAPERS. A SAS macro to estimate Average Treatment Effects with Matching Estimators
DOCUMENTS DE TRAVAIL CEMOI / CEMOI WORKING PAPERS A SAS macro to estimate Average Treatment Effects with Matching Estimators Nicolas Moreau 1 http://cemoi.univ-reunion.fr/publications/ Centre d'economie
More informationMatching for Causal Inference Without Balance Checking
Matching for ausal Inference Without Balance hecking Gary King Institute for Quantitative Social Science Harvard University joint work with Stefano M. Iacus (Univ. of Milan) and Giuseppe Porro (Univ. of
More informationIs My Matched Dataset As-If Randomized, More, Or Less? Unifying the Design and. Analysis of Observational Studies
Is My Matched Dataset As-If Randomized, More, Or Less? Unifying the Design and arxiv:1804.08760v1 [stat.me] 23 Apr 2018 Analysis of Observational Studies Zach Branson Department of Statistics, Harvard
More informationCausal Inference Basics
Causal Inference Basics Sam Lendle October 09, 2013 Observed data, question, counterfactuals Observed data: n i.i.d copies of baseline covariates W, treatment A {0, 1}, and outcome Y. O i = (W i, A i,
More informationVariable selection and machine learning methods in causal inference
Variable selection and machine learning methods in causal inference Debashis Ghosh Department of Biostatistics and Informatics Colorado School of Public Health Joint work with Yeying Zhu, University of
More informationDynamics in Social Networks and Causality
Web Science & Technologies University of Koblenz Landau, Germany Dynamics in Social Networks and Causality JProf. Dr. University Koblenz Landau GESIS Leibniz Institute for the Social Sciences Last Time:
More informationfinite-sample optimal estimation and inference on average treatment effects under unconfoundedness
finite-sample optimal estimation and inference on average treatment effects under unconfoundedness Timothy Armstrong (Yale University) Michal Kolesár (Princeton University) September 2017 Introduction
More informationA Measure of Robustness to Misspecification
A Measure of Robustness to Misspecification Susan Athey Guido W. Imbens December 2014 Graduate School of Business, Stanford University, and NBER. Electronic correspondence: athey@stanford.edu. Graduate
More informationarxiv: v4 [stat.ml] 31 Jan 2019
FLAME: A Fast Large-scale Almost Matching Exactly Approach to Causal Inference Tianyu Wang Sudeepa Roy Cynthia Rudin Alexander Volfovsky arxiv:1707.06315v4 [stat.ml] 31 Jan 2019 Abstract A classical problem
More informationBayesian Causal Forests
Bayesian Causal Forests for estimating personalized treatment effects P. Richard Hahn, Jared Murray, and Carlos Carvalho December 8, 2016 Overview We consider observational data, assuming conditional unconfoundedness,
More informationA Minimax Surrogate Loss Approach to Causal Inference
A Minimax Surrogate Loss Approach to Causal Inference Siong Thye Goh MIT Cynthia Rudin Duke University February 10, 2018 Abstract We present a new machine learning approach to estimate personalized treatment
More informationPropensity Score Analysis with Hierarchical Data
Propensity Score Analysis with Hierarchical Data Fan Li Alan Zaslavsky Mary Beth Landrum Department of Health Care Policy Harvard Medical School May 19, 2008 Introduction Population-based observational
More informationAn Introduction to Causal Mediation Analysis. Xu Qin University of Chicago Presented at the Central Iowa R User Group Meetup Aug 10, 2016
An Introduction to Causal Mediation Analysis Xu Qin University of Chicago Presented at the Central Iowa R User Group Meetup Aug 10, 2016 1 Causality In the applications of statistics, many central questions
More informationPropensity Score Matching
Methods James H. Steiger Department of Psychology and Human Development Vanderbilt University Regression Modeling, 2009 Methods 1 Introduction 2 3 4 Introduction Why Match? 5 Definition Methods and In
More informationPropensity Score Weighting with Multilevel Data
Propensity Score Weighting with Multilevel Data Fan Li Department of Statistical Science Duke University October 25, 2012 Joint work with Alan Zaslavsky and Mary Beth Landrum Introduction In comparative
More informationImbens/Wooldridge, IRP Lecture Notes 2, August 08 1
Imbens/Wooldridge, IRP Lecture Notes 2, August 08 IRP Lectures Madison, WI, August 2008 Lecture 2, Monday, Aug 4th, 0.00-.00am Estimation of Average Treatment Effects Under Unconfoundedness, Part II. Introduction
More informationStatistical Models for Causal Analysis
Statistical Models for Causal Analysis Teppei Yamamoto Keio University Introduction to Causal Inference Spring 2016 Three Modes of Statistical Inference 1. Descriptive Inference: summarizing and exploring
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationCausal inference in multilevel data structures:
Causal inference in multilevel data structures: Discussion of papers by Li and Imai Jennifer Hill May 19 th, 2008 Li paper Strengths Area that needs attention! With regard to propensity score strategies
More informationControlling for overlap in matching
Working Papers No. 10/2013 (95) PAWEŁ STRAWIŃSKI Controlling for overlap in matching Warsaw 2013 Controlling for overlap in matching PAWEŁ STRAWIŃSKI Faculty of Economic Sciences, University of Warsaw
More informationAn Introduction to Causal Analysis on Observational Data using Propensity Scores
An Introduction to Causal Analysis on Observational Data using Propensity Scores Margie Rosenberg*, PhD, FSA Brian Hartman**, PhD, ASA Shannon Lane* *University of Wisconsin Madison **University of Connecticut
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationEcon 673: Microeconometrics Chapter 12: Estimating Treatment Effects. The Problem
Econ 673: Microeconometrics Chapter 12: Estimating Treatment Effects The Problem Analysts are frequently interested in measuring the impact of a treatment on individual behavior; e.g., the impact of job
More informationstudies, situations (like an experiment) in which a group of units is exposed to a
1. Introduction An important problem of causal inference is how to estimate treatment effects in observational studies, situations (like an experiment) in which a group of units is exposed to a well-defined
More informationBackground of Matching
Background of Matching Matching is a method that has gained increased popularity for the assessment of causal inference. This method has often been used in the field of medicine, economics, political science,
More informationSection 10: Inverse propensity score weighting (IPSW)
Section 10: Inverse propensity score weighting (IPSW) Fall 2014 1/23 Inverse Propensity Score Weighting (IPSW) Until now we discussed matching on the P-score, a different approach is to re-weight the observations
More informationNBER WORKING PAPER SERIES A NOTE ON ADAPTING PROPENSITY SCORE MATCHING AND SELECTION MODELS TO CHOICE BASED SAMPLES. James J. Heckman Petra E.
NBER WORKING PAPER SERIES A NOTE ON ADAPTING PROPENSITY SCORE MATCHING AND SELECTION MODELS TO CHOICE BASED SAMPLES James J. Heckman Petra E. Todd Working Paper 15179 http://www.nber.org/papers/w15179
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationCausal Sensitivity Analysis for Decision Trees
Causal Sensitivity Analysis for Decision Trees by Chengbo Li A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics in Computer
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationAchieving Optimal Covariate Balance Under General Treatment Regimes
Achieving Optimal Covariate Balance Under General Treatment Regimes Marc Ratkovic September 21, 2011 Abstract Balancing covariates across treatment levels provides an effective and increasingly popular
More informationInstance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest
More informationGaussian Processes (10/16/13)
STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs
More informationMatching. Quiz 2. Matching. Quiz 2. Exact Matching. Estimand 2/25/14
STA 320 Design and Analysis of Causal Studies Dr. Kari Lock Morgan and Dr. Fan Li Department of Statistical Science Duke University Frequency 0 2 4 6 8 Quiz 2 Histogram of Quiz2 10 12 14 16 18 20 Quiz2
More informationMatching Techniques. Technical Session VI. Manila, December Jed Friedman. Spanish Impact Evaluation. Fund. Region
Impact Evaluation Technical Session VI Matching Techniques Jed Friedman Manila, December 2008 Human Development Network East Asia and the Pacific Region Spanish Impact Evaluation Fund The case of random
More informationComment on Article by Scutari
Bayesian Analysis (2013) 8, Number 3, pp. 543 548 Comment on Article by Scutari Hao Wang Scutari s paper studies properties of the distribution of graphs ppgq. This is an interesting angle because it differs
More informationSince the seminal paper by Rosenbaum and Rubin (1983b) on propensity. Propensity Score Analysis. Concepts and Issues. Chapter 1. Wei Pan Haiyan Bai
Chapter 1 Propensity Score Analysis Concepts and Issues Wei Pan Haiyan Bai Since the seminal paper by Rosenbaum and Rubin (1983b) on propensity score analysis, research using propensity score analysis
More informationESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics
ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 1. Introduction 2. The Sharp RD Design 3.
More informationClassification and Pattern Recognition
Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations
More informationGov 2002: 4. Observational Studies and Confounding
Gov 2002: 4. Observational Studies and Confounding Matthew Blackwell September 10, 2015 Where are we? Where are we going? Last two weeks: randomized experiments. From here on: observational studies. What
More informationBayesian Additive Regression Tree (BART) with application to controlled trail data analysis
Bayesian Additive Regression Tree (BART) with application to controlled trail data analysis Weilan Yang wyang@stat.wisc.edu May. 2015 1 / 20 Background CATE i = E(Y i (Z 1 ) Y i (Z 0 ) X i ) 2 / 20 Background
More informationPROPENSITY SCORE MATCHING. Walter Leite
PROPENSITY SCORE MATCHING Walter Leite 1 EXAMPLE Question: Does having a job that provides or subsidizes child care increate the length that working mothers breastfeed their children? Treatment: Working
More informationCausal Inference with General Treatment Regimes: Generalizing the Propensity Score
Causal Inference with General Treatment Regimes: Generalizing the Propensity Score David van Dyk Department of Statistics, University of California, Irvine vandyk@stat.harvard.edu Joint work with Kosuke
More informationNISS. Technical Report Number 167 June 2007
NISS Estimation of Propensity Scores Using Generalized Additive Models Mi-Ja Woo, Jerome Reiter and Alan F. Karr Technical Report Number 167 June 2007 National Institute of Statistical Sciences 19 T. W.
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationTargeted Maximum Likelihood Estimation in Safety Analysis
Targeted Maximum Likelihood Estimation in Safety Analysis Sam Lendle 1 Bruce Fireman 2 Mark van der Laan 1 1 UC Berkeley 2 Kaiser Permanente ISPE Advanced Topics Session, Barcelona, August 2012 1 / 35
More informationAssess Assumptions and Sensitivity Analysis. Fan Li March 26, 2014
Assess Assumptions and Sensitivity Analysis Fan Li March 26, 2014 Two Key Assumptions 1. Overlap: 0
More informationQuantitative Economics for the Evaluation of the European Policy
Quantitative Economics for the Evaluation of the European Policy Dipartimento di Economia e Management Irene Brunetti Davide Fiaschi Angela Parenti 1 25th of September, 2017 1 ireneb@ec.unipi.it, davide.fiaschi@unipi.it,
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative
More informationImplementing Matching Estimators for. Average Treatment Effects in STATA
Implementing Matching Estimators for Average Treatment Effects in STATA Guido W. Imbens - Harvard University West Coast Stata Users Group meeting, Los Angeles October 26th, 2007 General Motivation Estimation
More informationMinimax-Regret Sample Design in Anticipation of Missing Data, With Application to Panel Data. Jeff Dominitz RAND. and
Minimax-Regret Sample Design in Anticipation of Missing Data, With Application to Panel Data Jeff Dominitz RAND and Charles F. Manski Department of Economics and Institute for Policy Research, Northwestern
More informationEstimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing
Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing Alessandra Mattei Dipartimento di Statistica G. Parenti Università
More informationSelective Inference for Effect Modification
Inference for Modification (Joint work with Dylan Small and Ashkan Ertefaie) Department of Statistics, University of Pennsylvania May 24, ACIC 2017 Manuscript and slides are available at http://www-stat.wharton.upenn.edu/~qyzhao/.
More informationImplementing Matching Estimators for. Average Treatment Effects in STATA. Guido W. Imbens - Harvard University Stata User Group Meeting, Boston
Implementing Matching Estimators for Average Treatment Effects in STATA Guido W. Imbens - Harvard University Stata User Group Meeting, Boston July 26th, 2006 General Motivation Estimation of average effect
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationCausal Inference with Big Data Sets
Causal Inference with Big Data Sets Marcelo Coca Perraillon University of Colorado AMC November 2016 1 / 1 Outlone Outline Big data Causal inference in economics and statistics Regression discontinuity
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationGenetic Matching for Estimating Causal Effects:
Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies Alexis Diamond Jasjeet S. Sekhon Forthcoming, Review of Economics and
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationPropensity Score Methods for Causal Inference
John Pura BIOS790 October 2, 2015 Causal inference Philosophical problem, statistical solution Important in various disciplines (e.g. Koch s postulates, Bradford Hill criteria, Granger causality) Good
More informationCausal Inference by Minimizing the Dual Norm of Bias. Nathan Kallus. Cornell University and Cornell Tech
Causal Inference by Minimizing the Dual Norm of Bias Nathan Kallus Cornell University and Cornell Tech www.nathankallus.com Matching Zoo It s a zoo of matching estimators for causal effects: PSM, NN, CM,
More informationMoving the Goalposts: Addressing Limited Overlap in Estimation of Average Treatment Effects by Changing the Estimand
Moving the Goalposts: Addressing Limited Overlap in Estimation of Average Treatment Effects by Changing the Estimand Richard K. Crump V. Joseph Hotz Guido W. Imbens Oscar Mitnik First Draft: July 2004
More informationcausal inference at hulu
causal inference at hulu Allen Tran July 17, 2016 Hulu Introduction Most interesting business questions at Hulu are causal Business: what would happen if we did x instead of y? dropped prices for risky
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationPotential Outcomes Model (POM)
Potential Outcomes Model (POM) Relationship Between Counterfactual States Causality Empirical Strategies in Labor Economics, Angrist Krueger (1999): The most challenging empirical questions in economics
More informationAlgorithm-Independent Learning Issues
Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning
More informationThe propensity score with continuous treatments
7 The propensity score with continuous treatments Keisuke Hirano and Guido W. Imbens 1 7.1 Introduction Much of the work on propensity score analysis has focused on the case in which the treatment is binary.
More informationWhen Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?
When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data? Kosuke Imai Department of Politics Center for Statistics and Machine Learning Princeton University
More informationSIMULATION-BASED SENSITIVITY ANALYSIS FOR MATCHING ESTIMATORS
SIMULATION-BASED SENSITIVITY ANALYSIS FOR MATCHING ESTIMATORS TOMMASO NANNICINI universidad carlos iii de madrid UK Stata Users Group Meeting London, September 10, 2007 CONTENT Presentation of a Stata
More informationMeasuring Social Influence Without Bias
Measuring Social Influence Without Bias Annie Franco Bobbie NJ Macdonald December 9, 2015 The Problem CS224W: Final Paper How well can statistical models disentangle the effects of social influence from
More informationLearning from Examples
Learning from Examples Data fitting Decision trees Cross validation Computational learning theory Linear classifiers Neural networks Nonparametric methods: nearest neighbor Support vector machines Ensemble
More informationImbens/Wooldridge, Lecture Notes 1, Summer 07 1
Imbens/Wooldridge, Lecture Notes 1, Summer 07 1 What s New in Econometrics NBER, Summer 2007 Lecture 1, Monday, July 30th, 9.00-10.30am Estimation of Average Treatment Effects Under Unconfoundedness 1.
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationMatching. Stephen Pettigrew. April 15, Stephen Pettigrew Matching April 15, / 67
Matching Stephen Pettigrew April 15, 2015 Stephen Pettigrew Matching April 15, 2015 1 / 67 Outline 1 Logistics 2 Basics of matching 3 Balance Metrics 4 Matching in R 5 The sample size-imbalance frontier
More informationEstimation of the Conditional Variance in Paired Experiments
Estimation of the Conditional Variance in Paired Experiments Alberto Abadie & Guido W. Imbens Harvard University and BER June 008 Abstract In paired randomized experiments units are grouped in pairs, often
More informationLinear Model Selection and Regularization
Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In
More informationMachine Learning (CS 567) Lecture 5
Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationMining Classification Knowledge
Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification
More informationCompSci Understanding Data: Theory and Applications
CompSci 590.6 Understanding Data: Theory and Applications Lecture 17 Causality in Statistics Instructor: Sudeepa Roy Email: sudeepa@cs.duke.edu Fall 2015 1 Today s Reading Rubin Journal of the American
More informationMachine Learning for Signal Processing Bayes Classification
Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1 Recap: KNN A very effective and simple way of performing classification
More informationDiversity Regularization of Latent Variable Models: Theory, Algorithm and Applications
Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Pengtao Xie, Machine Learning Department, Carnegie Mellon University 1. Background Latent Variable Models (LVMs) are
More informationarxiv: v1 [stat.me] 15 May 2011
Working Paper Propensity Score Analysis with Matching Weights Liang Li, Ph.D. arxiv:1105.2917v1 [stat.me] 15 May 2011 Associate Staff of Biostatistics Department of Quantitative Health Sciences, Cleveland
More informationApplied Microeconometrics (L5): Panel Data-Basics
Applied Microeconometrics (L5): Panel Data-Basics Nicholas Giannakopoulos University of Patras Department of Economics ngias@upatras.gr November 10, 2015 Nicholas Giannakopoulos (UPatras) MSc Applied Economics
More informationCombining multiple observational data sources to estimate causal eects
Department of Statistics, North Carolina State University Combining multiple observational data sources to estimate causal eects Shu Yang* syang24@ncsuedu Joint work with Peng Ding UC Berkeley May 23,
More informationBayesian Inference for Sequential Treatments under Latent Sequential Ignorability. June 19, 2017
Bayesian Inference for Sequential Treatments under Latent Sequential Ignorability Alessandra Mattei, Federico Ricciardi and Fabrizia Mealli Department of Statistics, Computer Science, Applications, University
More informationA Note on Adapting Propensity Score Matching and Selection Models to Choice Based Samples
DISCUSSION PAPER SERIES IZA DP No. 4304 A Note on Adapting Propensity Score Matching and Selection Models to Choice Based Samples James J. Heckman Petra E. Todd July 2009 Forschungsinstitut zur Zukunft
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationPredicting the Treatment Status
Predicting the Treatment Status Nikolay Doudchenko 1 Introduction Many studies in social sciences deal with treatment effect models. 1 Usually there is a treatment variable which determines whether a particular
More information