Achieving Optimal Covariate Balance Under General Treatment Regimes

Similar documents
Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

Achieving Optimal Covariate Balance Under General Treatment Regimes

Causal Inference in Observational Studies with Non-Binary Treatments. David A. van Dyk

Gov 2002: 5. Matching

Covariate Balancing Propensity Score for General Treatment Regimes

Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity

THE DESIGN (VERSUS THE ANALYSIS) OF EVALUATIONS FROM OBSERVATIONAL STUDIES: PARALLELS WITH THE DESIGN OF RANDOMIZED EXPERIMENTS DONALD B.

Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies

Interaction Effects and Causal Heterogeneity

An Introduction to Causal Analysis on Observational Data using Propensity Scores

Selective Inference for Effect Modification

Causal Inference with a Continuous Treatment and Outcome: Alternative Estimators for Parametric Dose-Response Functions

Balancing Covariates via Propensity Score Weighting

Background of Matching

Gov 2002: 4. Observational Studies and Confounding

Matching for Causal Inference Without Balance Checking

High Dimensional Propensity Score Estimation via Covariate Balancing

Dynamics in Social Networks and Causality

Propensity-Score Based Methods for Causal Inference in Observational Studies with Fixed Non-Binary Treatments

Discussion of Papers on the Extensions of Propensity Score

Weighting. Homework 2. Regression. Regression. Decisions Matching: Weighting (0) W i. (1) -å l i. )Y i. (1-W i 3/5/2014. (1) = Y i.

Bayesian Causal Forests

Use of Matching Methods for Causal Inference in Experimental and Observational Studies. This Talk Draws on the Following Papers:

arxiv: v1 [stat.me] 18 Nov 2018

Weighting Methods. Harvard University STAT186/GOV2002 CAUSAL INFERENCE. Fall Kosuke Imai

finite-sample optimal estimation and inference on average treatment effects under unconfoundedness

Use of Matching Methods for Causal Inference in Experimental and Observational Studies. This Talk Draws on the Following Papers:

Selection on Observables: Propensity Score Matching.

Guanglei Hong. University of Chicago. Presentation at the 2012 Atlantic Causal Inference Conference

What s New in Econometrics. Lecture 1

Overlap Propensity Score Weighting to Balance Covariates

studies, situations (like an experiment) in which a group of units is exposed to a

Genetic Matching for Estimating Causal Effects:

Matching. Quiz 2. Matching. Quiz 2. Exact Matching. Estimand 2/25/14

TriMatch: An R Package for Propensity Score Matching of Non-binary Treatments

Balancing Covariates via Propensity Score Weighting: The Overlap Weights

Propensity Score Weighting with Multilevel Data

Propensity Score Methods for Causal Inference

Optimal Treatment Regimes for Survival Endpoints from a Classification Perspective. Anastasios (Butch) Tsiatis and Xiaofei Bai

Job Training Partnership Act (JTPA)

Implementing Matching Estimators for. Average Treatment Effects in STATA

Propensity Score Matching

Support Vector Hazard Regression (SVHR) for Predicting Survival Outcomes. Donglin Zeng, Department of Biostatistics, University of North Carolina

Discussion: Can We Get More Out of Experiments?

Imbens/Wooldridge, IRP Lecture Notes 2, August 08 1

Implementing Matching Estimators for. Average Treatment Effects in STATA. Guido W. Imbens - Harvard University Stata User Group Meeting, Boston

Section 10: Inverse propensity score weighting (IPSW)

Variable selection and machine learning methods in causal inference

1 Introduction Overview of the Book How to Use this Book Introduction to R 10

Propensity Score Analysis Using teffects in Stata. SOC 561 Programming for the Social Sciences Hyungjun Suh Apr

c 4, < y 2, 1 0, otherwise,

Propensity Score Analysis with Hierarchical Data

Kernel Methods and Support Vector Machines

A Course in Applied Econometrics. Lecture 2 Outline. Estimation of Average Treatment Effects. Under Unconfoundedness, Part II

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Combining multiple observational data sources to estimate causal eects

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Flexible Estimation of Treatment Effect Parameters

Statistical Analysis of Causal Mechanisms

Perceptron Revisited: Linear Separators. Support Vector Machines

CS6220: DATA MINING TECHNIQUES

A Theory of Statistical Inference for Matching Methods in Causal Research

A Theory of Statistical Inference for Matching Methods in Applied Causal Research

Difference-in-Differences Methods

Stat 5101 Lecture Notes

New Developments in Econometrics Lecture 11: Difference-in-Differences Estimation

An Introduction to Causal Mediation Analysis. Xu Qin University of Chicago Presented at the Central Iowa R User Group Meetup Aug 10, 2016

Matching. Stephen Pettigrew. April 15, Stephen Pettigrew Matching April 15, / 67

Deductive Derivation and Computerization of Semiparametric Efficient Estimation

Robust Estimation of Inverse Probability Weights for Marginal Structural Models

Front-Door Adjustment

The Econometric Evaluation of Policy Design: Part I: Heterogeneity in Program Impacts, Modeling Self-Selection, and Parameters of Interest

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

PSC 504: Dynamic Causal Inference

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Statistical Analysis of Randomized Experiments with Nonignorable Missing Binary Outcomes

Causal Inference by Minimizing the Dual Norm of Bias. Nathan Kallus. Cornell University and Cornell Tech

Causal Interaction in Factorial Experiments: Application to Conjoint Analysis

Kernel Logistic Regression and the Import Vector Machine

Gov 2002: 13. Dynamic Causal Inference

Causality and Experiments

Propensity Score Matching and Variations on the Balancing Test

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Evaluation. Andrea Passerini Machine Learning. Evaluation

Terminology for Statistical Data

Why experimenters should not randomize, and what they should do instead

PAC-learning, VC Dimension and Margin-based Bounds

Statistical Models for Causal Analysis

Evaluation requires to define performance measures to be optimized

ECE 5424: Introduction to Machine Learning

PSC 504: Differences-in-differeces estimators

Causal Interaction in High Dimension

Deep Learning for Computer Vision

Propensity Score Matching and Variations on the Balancing Test

CS6220: DATA MINING TECHNIQUES

Gov 2002: 9. Differences in Differences

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing

Empirical Analysis III

Transcription:

Achieving Under General Treatment Regimes Marc Ratkovic Princeton University May 24, 2012

Motivation For many questions of interest in the social sciences, experiments are not possible Possible bias in effect estimates Regression adjustment or inverse weighting can be used to adjust for selection bias Model dependence Matching reduces bias and model dependence by identifying a set of untreated observations that are similar to the treated observations

Problems with Existing Matching Methods Existing matching methods, such as propensity matching, Genetic matching, and Coarsened Exact Matching, rely on many user inputs are sensitive to these choices have no formal statistical properties can only handle a binary treatment

Benefits of the Proposed Method The proposed method rely on many user inputs is fully automated are sensitive to these choices have no formal statistical properties can only handle a binary treatment

Benefits of the Proposed Method The proposed method rely on many user inputs is fully automated are sensitive to these choices makes no functional form assumptions have no formal statistical properties can only handle a binary treatment

Benefits of the Proposed Method The proposed method rely on many user inputs is fully automated are sensitive to these choices makes no functional form assumptions have no formal statistical properties identifies the largest balanced subset can only handle a binary treatment

Benefits of the Proposed Method The proposed method rely on many user inputs is fully automated are sensitive to these choices makes no functional form assumptions have no formal statistical properties identifies the largest balanced subset can only handle a binary treatment can also accommodate continuous treatments

The Setup The Setup Treatment: T i Binary treatment: T i {0, 1} Continuous treatment: T i (a, b) Potential outcome: Y i (t) Pre-treatment covariates: X i IID observations (Y i (T i ), T i, X i ) observed Assumptions No interference among units Treatment occurs with uncertainty No omitted variables

Assumptions and Estimands Goal of Matching: Identify a subset of the data such that the covariates are balanced T i X i

Assumptions and Estimands Goal of Matching: Identify a subset of the data such that the covariates are balanced T i X i Common estimands identified on a balanced subset of the data: Average Treatment Effect: E(Y i (1) Y i (0)) Average Treatment Effect on the Treated: E(Y i (1) Y i (0) T i = 1)

Basic Insight of Proposed Method The proposed method formulates a Support Vector Machine that identifies a balanced subset of the data.

Basic Insight of Proposed Method The proposed method formulates a Support Vector Machine that identifies a balanced subset of the data. The logic of the proposed method proceeds in three steps: The optimality condition for an SVM sets an inner product between the treatment level and a covariate to zero Centering the treatment and covariate transforms this inner product to balance-in-mean or zero covariance. Balancing along a nonparametric basis extends the mean/covariance result to joint independence.

A Simple Example: The Binary Matching SVM Assume an observed covariate vector, X i, and target function X i β.

A Simple Example: The Binary Matching SVM Assume an observed covariate vector, X i, and target function X i β. Center X i on the treated observations as: Xi i = X i X i 1(T i = 1) i 1(T i = 1)

A Simple Example: The Binary Matching SVM Assume an observed covariate vector, X i, and target function X i β. Center X i on the treated observations as: Xi i = X i X i 1(T i = 1) i 1(T i = 1) Transform T i from {0, 1} to { 1, 1}: T i = 2T i 1

A Simple Example: The Binary Matching SVM Define the hinge loss z + = max(z, 0) Loss function: L(β) = i 1 T i X i β + s.t. X i β 1(T i = 1) < 1

A Simple Example: The Binary Matching SVM Define the hinge loss z + = max(z, 0) Loss function: L(β) = i 1 T i X i β + s.t. X i β 1(T i = 1) < 1 Hard to classify and Easy to classify cases: Ti = 1; Xi β = 2 : 1 1 2 + = 1 + = 0 Easy to classify

A Simple Example: The Binary Matching SVM Define the hinge loss z + = max(z, 0) Loss function: L(β) = i 1 T i X i β + s.t. X i β 1(T i = 1) < 1 Hard to classify and Easy to classify cases: Ti = 1; Xi β = 2 : 1 1 2 + = 1 + = 0 Easy to classify Ti = 1; Xi β = 0.5 : 1 ( 1) ( 0.5) + = 0.5 + = 0.5 Hard to classify The constraint keeps the loss for all treated observations as non-zero to identify the ATT.

Geometric Intuition of Proposed Method Properly classified cases outside the margin are easy-to-classify. Cases in the margin, or improperly classified, have a treatment assignment estimated with some uncertainty. O Separating Hyperplane O O O O O O O O O O X O O X O O O X X XX O X O O X X X X X X X X X X X X Margin X

A Simple Example: The Binary Matching SVM Define M = {i : 1 T i X i β > 0}

A Simple Example: The Binary Matching SVM Define M = {i : 1 T i X i β > 0} Taking and expanding the first order condition: β i 1 Ti X i β + = i T i X i 1(i M) = 0

A Simple Example: The Binary Matching SVM Define M = {i : 1 T i X i β > 0} Taking and expanding the first order condition: 1 Ti β X i β + = Ti X i 1(i M) = 0 i i Xi 1(T i = 0, i M) = Xi 1(T i = 1) i i

A Simple Example: The Binary Matching SVM Define M = {i : 1 T i X i β > 0} Taking and expanding the first order condition: 1 Ti β X i β + = Ti X i 1(i M) = 0 i i Xi 1(T i = 0, i M) = Xi 1(T i = 1) i i Xi 1(T i = 0, i M) = Xi 1(T i = 1) = 0 i i } {{ } since Xi is centered on T i = 1

A Simple Example: The Binary Matching SVM Define M = {i : 1 T i X i β > 0} Taking and expanding the first order condition: 1 Ti β X i β + = Ti X i 1(i M) = 0 i i Xi 1(T i = 0, i M) = Xi 1(T i = 1) i i Xi 1(T i = 0, i M) = Xi 1(T i = 1) = 0 i i } {{ } since Xi is centered on T i = 1 Law of Large Numbers gives E(X i T i = 1) = E(X i T i = 0, i M)

The Binary Treatment SVM Balance-in-means is not balance in distribution.

The Binary Treatment SVM Balance-in-means is not balance in distribution. To achieve joint independence (Proposition 1): Change the target functional from X i β to η (X i ) Add a regularization term, to balance covariate imbalance and model complexity Observations in M are balanced

The Binary Treatment SVM Balance-in-means is not balance in distribution. To achieve joint independence (Proposition 1): Change the target functional from X i β to η (X i ) Add a regularization term, to balance covariate imbalance and model complexity Observations in M are balanced The proof follows nearly exactly as the linear case, except in a high-dimensional space.

Extension to a Continuous Treatment Follows nearly exactly from the binary case.

Extension to a Continuous Treatment Follows nearly exactly from the binary case. Using identical reasoning, I show that, for observations in M, the treatment and a single covariate is uncorrelated.

Extension to a Continuous Treatment Follows nearly exactly from the binary case. Using identical reasoning, I show that, for observations in M, the treatment and a single covariate is uncorrelated. Extension to a nonparametric function of X i transforms uncorrelatedness to joint independence (Proposition 2).

Properties and Comparisons Selects largest balanced subset

Properties and Comparisons Selects largest balanced subset In a simple randomized experiment, the sample size is twice the expected misclassification loss

Properties and Comparisons Selects largest balanced subset In a simple randomized experiment, the sample size is twice the expected misclassification loss Number of balanced observations approaches twice the expected misclassification loss asymptotically

Properties and Comparisons Selects largest balanced subset In a simple randomized experiment, the sample size is twice the expected misclassification loss Number of balanced observations approaches twice the expected misclassification loss asymptotically Answers question of how many matches

Properties and Comparisons Selects largest balanced subset In a simple randomized experiment, the sample size is twice the expected misclassification loss Number of balanced observations approaches twice the expected misclassification loss asymptotically Answers question of how many matches Tuning parameters selected through GACV criterion

Properties and Comparisons Selects largest balanced subset In a simple randomized experiment, the sample size is twice the expected misclassification loss Number of balanced observations approaches twice the expected misclassification loss asymptotically Answers question of how many matches Tuning parameters selected through GACV criterion Identifies observations that appear to follow a simple randomization Most useful when researcher does not know which variables to match finely, exactly, in mean, etc.

Returning the Experimental Result from Experimental Data The 1975-1978 National Supported Work Study (Lalonde 1986) Treatment: job training, close management, peer support Recipients: welfare recipients, ex-addicts, young school dropouts, and ex-offenders n=445: 260 treated; 185 control PSID data used for matching, n=2490 X: age, years of education, race, marriage status, high school degree, 1974 earnings, 1975 earnings, zero earnings in 1974, zero earnings in 1975

Analyses Competitors Logistic propensity matching (Ho, et al. 2011) Genetic Matching (Sekhon 2011) Coarsened Exact Matching (Iacus, et al. 2011) CEM estimates ATT through extrapolation BART+ Optimal Matching (Hill, et al. 2011; Hansen 2004)

Analyses Competitors Logistic propensity matching (Ho, et al. 2011) Genetic Matching (Sekhon 2011) Coarsened Exact Matching (Iacus, et al. 2011) CEM estimates ATT through extrapolation BART+ Optimal Matching (Hill, et al. 2011; Hansen 2004) Outcomes 1978 earnings 1978 earnings - 1975 earnings

Analyses Competitors Logistic propensity matching (Ho, et al. 2011) Genetic Matching (Sekhon 2011) Coarsened Exact Matching (Iacus, et al. 2011) CEM estimates ATT through extrapolation BART+ Optimal Matching (Hill, et al. 2011; Hansen 2004) Outcomes Datasets 1978 earnings 1978 earnings - 1975 earnings Experimental treated and untreated observations Experimental treated observations; observational untreated observations

Experimental Results Density of Treatment Effect Estimates Across Model Specifications, Using NSW Experimental Data Outcome: Earnings, 1978 Outcome: Earnings, 1978 1975 Density 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 Experimental Data Density 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 Experimental Data 1200 1400 1600 1800 2000 2200 Estimated Effect, Dollars 1200 1400 1600 1800 2000 2200 Estimated Effect, Dollars

Experimental Results Density of Treatment Effect Estimates Across Model Specifications, Using NSW Experimental Data Outcome: Earnings, 1978 Outcome: Earnings, 1978 1975 Density 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 CEM BART Propensity GenMatch Experimental Data Density 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 CEM BART Propensity GenMatch Experimental Data 1200 1400 1600 1800 2000 2200 Estimated Effect, Dollars 1200 1400 1600 1800 2000 2200 Estimated Effect, Dollars

Experimental Results Density of Treatment Effect Estimates Across Model Specifications, Using NSW Experimental Data Outcome: Earnings, 1978 Outcome: Earnings, 1978 1975 Density 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 CEM BART Propensity GenMatch SVM Experimental Data Density 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 CEM BART SVM Propensity GenMatch Experimental Data 1200 1400 1600 1800 2000 2200 Estimated Effect, Dollars 1200 1400 1600 1800 2000 2200 Estimated Effect, Dollars

Observational Results Density of Treatment Effect Estimates Across Model Specifications, Untreated Observations Taken from Observational PSID Data Outcome: Earnings, 1978 Outcome: Earnings, 1978 1975 Density 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 Experimental Data Density 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 Experimental Data 2000 1000 0 1000 2000 Estimated Effect, Dollars 2000 1000 0 1000 2000 Estimated Effect, Dollars

Observational Results Density of Treatment Effect Estimates Across Model Specifications, Untreated Observations Taken from Observational PSID Data Outcome: Earnings, 1978 Outcome: Earnings, 1978 1975 Density 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 Experimental Data CEM GenMatch Propensity BART Density 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 CEM Propensity GenMatch Experimental Data BART 2000 1000 0 1000 2000 Estimated Effect, Dollars 2000 1000 0 1000 2000 Estimated Effect, Dollars

Observational Results Density of Treatment Effect Estimates Across Model Specifications, Untreated Observations Taken from Observational PSID Data Outcome: Earnings, 1978 Outcome: Earnings, 1978 1975 Density 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 Experimental Data SVM CEM GenMatch Propensity BART Density 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 CEM Propensity Experimental Data GenMatch SVM BART 2000 1000 0 1000 2000 Estimated Effect, Dollars 2000 1000 0 1000 2000 Estimated Effect, Dollars

Smoking and Medical Expenditures The 1987 National Medical Expenditure Survey (Johnson, et al. 2003; Imai and van Dyk 2004) Treatment: log(pack years): packs a day times number of years smoking, logged Respondents: Representative sample of US population n = 9, 708 smokers; to be balanced n = 9, 804 non-smokers; reference group Outcome: Medical expenditure, dollars X: age at survey, age when started smoking, gender, race, education, marital status, census region, poverty status, seat-belt use

Assessing Balance Treatment (Logged Packyears) vs. Key Predictors For the Matched (Black) and Complete (Gray) Observations Age When Started Smoking Age At Survey Logged Pack years 1 0 1 2 3 4 Mean by age, matched subset Mean by age, complete dataset Logged Pack years 1 0 1 2 3 4 Mean by age, matched subset Mean by age, complete dataset 10 20 30 40 50 60 70 20 40 60 80 Years Years

Assessing Balance Treatment (Logged Packyears) vs. Key Predictors For the Matched (Black) and Complete (Gray) Observations Age When Started Smoking Age At Survey Logged Pack years 1 0 1 2 3 4 Mean by age, matched subset Mean by age, complete dataset Logged Pack years 1 0 1 2 3 4 Mean by age, matched subset Mean by age, complete dataset 10 20 30 40 50 60 70 20 40 60 80 Years Years

Assessing Balance Quantile Plot of Coefficient p values from Regressing the Treatment On Pretreatment Covariates, Versus a Uniform Distribution Weighted Data Unweighted Data Observed p values 0.0 0.2 0.4 0.6 0.8 1.0 Matched subset, (R 2 = 0.0039; p = 0.7899) Subclassification, (R 2 = 0.0295; p = 0.0004) Full dataset, (R 2 = 0.3413; p < 2.2x10 16 ) Observed p values 0.0 0.2 0.4 0.6 0.8 1.0 Matched subset, (R 2 = 0.0047; p = 0.5899) Subclassification, (R 2 = 0.0298; p = 0.0004) Full dataset, (R 2 = 0.3401; p < 2.2x10 16 ) 0.0 0.2 0.4 0.6 0.8 1.0 Expected p values 0.0 0.2 0.4 0.6 0.8 1.0 Expected p values

Estimated Effect Medical Expenditures Relative to Non Smokers Versus Pack years Treatment Effect, Dollars 1000 500 0 500 1000 Conditional Mean 95% Bayesian Conf Interval 0.05 0.25 1.00 10.00 25.00 50.00 100.00 200.00 Pack years

Conclusion The proposed method adapts the SVM technology to the matching problem. The method: is fully automated makes no functional form assumptions identifies the largest balanced subset can also accommodate continuous treatments